Lecture 11
1 Influence of data points in likelihood equations
The material in this section is from (Collett 1994), (Cain and Lange 1984), and (Broderick, Giordano, and Meager 2023). Like in linear regression, we’d like to determine if some of our data points are influencing our conclusions; armed with this information, perhaps we can expand the model to incorporate these outliers, or perhaps there is a data processing error that we can rectify and re-run our analysis.
One idea is to determine whether omitting one data point appreciably changes our estimate of our parameter of interest. The simplest way to do this is to refit the data \(n\)-times, where each time we omit one data point. For small datasets, this is reasonable, but when we have large \(n\), or a very complex model, it may be infeasible to refit the model \(n\) times.
Instead, we can cleverly use Taylor expansions to approximate the effect of small perturbations in the data on the estimated coefficient. If these small perturbations induce large changes in our estimated coefficients, then it stands to reason that the datapoints that have been perturbed are influential to our estimates.
Let’s make things more concrete. Suppose we have a model with a parameter vector, \(\boldsymbol{\theta} \in \R^k\), and a maximum likelihood estimate thereof \(\hat{\boldsymbol{\theta}}\). We’d like to understand how \(\hat{\boldsymbol{\theta}}\) changes if we drop one datapoint. Let the index of this datapoint be \(j\). We can formalize the idea of dropping a datapoint by examining the score equations. Recall our typical problem setup: We have \(n\) observations, each of which is a triplet of the time to failure or the time to censoring, \(t_i\), an indicator \(\delta_i\) that \(t_i\) is the time to failure, and \(\mathbf{z}_i \in \R^k\), the covariate vector associated with each unit. Let this triplet of information associated with each unit \(i\) be denoted \(\mathbf{y}_i = (t_i, \delta_i, \mathbf{z}_i)\). As in previous lectures, let the log-likelihood be defined as: \[ \ell(\boldsymbol{\theta}; \mathbf{y}_i) \equiv f_{\boldsymbol{\theta}}(t_i, \delta_i, \mathbf{z}_i), \] and let \[\ell(\boldsymbol{\theta}) = \sum_{i=1}^n \ell(\boldsymbol{\theta}; \mathbf{y}_i).\] Recall the notation for the score : \[ \ell_\boldsymbol{\theta}(\boldsymbol{\theta}') = \lp\nabla_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta})\rp \Big|_{\boldsymbol{\theta} = \boldsymbol{\theta}'} \] Let \(\hat{\boldsymbol{\theta}}\) be the value for \(\boldsymbol{\theta}\) that solves the score equations \[\ell_\boldsymbol{\theta}(\hat{\boldsymbol{\theta}}) = \mathbf{0}\]
We now introduce the variable \(\mathbf{w} = (1, \dots, 1)^T\) into the log-likelihood, \[ \ell(\boldsymbol{\theta},\mathbf{w}) = \sum_{i=1}^n w_i \ell(\boldsymbol{\theta}; \mathbf{y}_i), \] and by linearity of the gradient \[\begin{align} \ell_\boldsymbol{\theta}(\boldsymbol{\theta}, \mathbf{w}) = \sum_{i=1}^n w_i \ell_\boldsymbol{\theta}(\boldsymbol{\theta}; \mathbf{y}_i) \end{align}\] Now denote the vector \(\hat{\boldsymbol{\theta}}(\mathbf{w})\) which solves the equations \[ \ell_\boldsymbol{\theta}(\boldsymbol{\theta}, \mathbf{w}) = \mathbf{0}. \]
Note that our original MLE, \(\hat{\boldsymbol{\theta}} \equiv \hat{\boldsymbol{\theta}}(\mathbf{1})\). This notation now gives us a way to delete the \(j^\mathrm{th}\) datapoint, namely by setting \(w_j = 0\).
Then we’ll approximate \(\hat{\boldsymbol{\theta}}(\mathbf{w}')\) for \(\mathbf{w}'\) “near” the vector \(\mathbf{1}\).
For the \(m^\mathrm{th}\) element of \(\hat{\boldsymbol{\theta}}(\mathbf{w})\), \(\hat{\boldsymbol{\theta}}(\mathbf{w})_m\), this is: \[\begin{align} \hat{\boldsymbol{\theta}}(\mathbf{w})_m \approx \hat{\boldsymbol{\theta}}(\mathbf{1})_m + \sum_{i=1}^n (w_i - 1) \left(\frac{\partial}{\partial w_i} \hat{\boldsymbol{\theta}}(\mathbf{w})_m\right)\Bigg|_{\mathbf{w} = \mathbf{1}} \end{align}\] When all but one of these \(w_i\) is equal to \(1\), namely \(w_j = 0\), let \[\mathbf{w}_{(j)} = (\mathbf{1}_{j-1}^T, 0, \mathbf{1}_{n - j}^T)^T.\] Then we get \[\begin{align} \hat{\boldsymbol{\theta}}(\mathbf{w}_{(j)})_m \approx \hat{\boldsymbol{\theta}}(\mathbf{1})_m - \left(\frac{\partial}{\partial w_j} \hat{\boldsymbol{\theta}}(\mathbf{w})_m\right)\Bigg|_{\mathbf{w} = \mathbf{1}} \end{align}.\] For the whole vector \(\hat{\boldsymbol{\theta}}(\mathbf{w}_{(j)})\) we get, as in (Cain and Lange 1984), \[\begin{align} \hat{\boldsymbol{\theta}}(\mathbf{w}_{(j)}) \approx \hat{\boldsymbol{\theta}}(\mathbf{1}) - \left(\frac{\partial}{\partial w_j} \hat{\boldsymbol{\theta}}(\mathbf{w})\right)\Bigg|_{\mathbf{w} = \mathbf{1}} \end{align}\] where \[\frac{\partial}{\partial w_j} \hat{\boldsymbol{\theta}}(\mathbf{w}) = (\frac{\partial}{\partial w_j} \hat{\boldsymbol{\theta}}(\mathbf{w})_1, \dots, \frac{\partial}{\partial w_j} \hat{\boldsymbol{\theta}}(\mathbf{w})_k)^T.\] The question remains how to calculate \(\frac{\partial}{\partial w_j} \hat{\boldsymbol{\theta}}(\mathbf{w})\) evaluated at \(\mathbf{w}=1\)?
Note that the score equations are a function of the parameter vector and the vector of weights. The MLE given a set of weights \(\mathbf{w}\), \(\hat{\boldsymbol{\theta}}(\mathbf{w})\) solves the system of equations: \[\ell_\boldsymbol{\theta}(\hat{\boldsymbol{\theta}}(\mathbf{w}), \mathbf{w}) = \mathbf{0}.\] Then the implicit function theorem allows us to differentiate the expression above with respect to \(w_j\) and solve for the derivative of interest, \(\frac{\partial}{\partial w_j} \hat{\boldsymbol{\theta}}(\mathbf{w})\).
Recall the chain rule for multivariate functions: Let \(g(t) = v(x(t), y(t))\) and calculate \(\frac{\partial}{\partial t} g(t)\): \[ \begin{aligned} \frac{\partial}{\partial t} v(x(t), y(t)) & = \frac{\partial v(x, y)}{\partial x}\Bigg|_{x=x(t), y=y(t)} \frac{\partial x(u)}{\partial u}\Bigg|_{u = t} \\ & + \frac{\partial v(x, y)}{\partial y}\Bigg|_{x = x(t), y = y(t)} \frac{\partial y(u)}{\partial u}\Bigg|_{u = t}\end{aligned}.\] We can differentiate the expression for the score function: \[\begin{align*} \frac{\partial}{\partial w_j} \mathbf{0} & = \frac{\partial}{\partial w_j} \ell_\boldsymbol{\theta}(\hat{\boldsymbol{\theta}}(\mathbf{w}), \mathbf{w}) \\ \mathbf{0} & = \frac{\partial \ell_\boldsymbol{\theta}(\boldsymbol{\theta},\mathbf{w})}{\partial \boldsymbol{\theta}^T}\Bigg|_{\boldsymbol{\theta} = \hat{\boldsymbol{\theta}}(\mathbf{1}), \mathbf{w}=\mathbf{1}} \frac{\hat{\boldsymbol{\theta}}(\mathbf{w})}{\partial w_j}\Bigg|_{\mathbf{w}=\mathbf{1}} + \frac{\partial \ell_\boldsymbol{\theta}(\boldsymbol{\theta}, \mathbf{w})}{\partial w_j}\Bigg|_{\boldsymbol{\theta} = \hat{\boldsymbol{\theta}}(\mathbf{1}), \mathbf{w}=\mathbf{1}} \\ \mathbf{0} & = \ell_{\boldsymbol{\theta}\boldsymbol{\theta}}(\hat{\boldsymbol{\theta}}(\mathbf{1}),\mathbf{1}) \frac{\hat{\boldsymbol{\theta}}(\mathbf{w})}{\partial w_j}\Bigg|_{\mathbf{w}=\mathbf{1}} + \frac{\partial \ell_\boldsymbol{\theta}(\boldsymbol{\theta}, \mathbf{w})}{\partial w_j}\Bigg|_{\boldsymbol{\theta} = \hat{\boldsymbol{\theta}}(\mathbf{1}), \mathbf{w}=\mathbf{1}} \end{align*}\] Assuming that the observed information matrix for the complete dataset, or: \[\ell_{\boldsymbol{\theta}\boldsymbol{\theta}}(\hat{\boldsymbol{\theta}}(\mathbf{1}),\mathbf{1}) \] is invertible, we can solve the equation for the quantity of interest, \(\frac{\partial}{\partial w_j} \hat{\boldsymbol{\theta}}(\mathbf{w})\) evaluated at \(\mathbf{w}=1\). Note that \[ \frac{\partial \ell_\boldsymbol{\theta}(\boldsymbol{\theta}, \mathbf{w})}{\partial w_j}\Bigg|_{\boldsymbol{\theta} = \hat{\boldsymbol{\theta}}(\mathbf{1}), \mathbf{w}=\mathbf{1}} = \ell_\boldsymbol{\theta}(\hat{\boldsymbol{\theta}}(\mathbf{1}), \mathbf{1};\mathbf{y}_i) \] so
\[\begin{align*} \frac{\hat{\boldsymbol{\theta}}(\mathbf{w})}{\partial w_j}\mid_{\mathbf{w}=\mathbf{1}} = \left(-\ell_{\boldsymbol{\theta}\boldsymbol{\theta}}(\hat{\boldsymbol{\theta}}(\mathbf{1}),\mathbf{1})\right)^{-1} \ell_\boldsymbol{\theta}(\hat{\boldsymbol{\theta}}(\mathbf{1}), \mathbf{1};\mathbf{y}_i) \end{align*}\] Finally, we get the general equation for the sensitivity of the MLE to the deletion of the \(j^\mathrm{th}\) data point: \[\begin{align} \hat{\boldsymbol{\theta}}(\mathbf{1}) - \hat{\boldsymbol{\theta}}(\mathbf{w}_{(j)}) \approx \left(-\ell_{\boldsymbol{\theta}\boldsymbol{\theta}}(\hat{\boldsymbol{\theta}})\right)^{-1} \ell_\boldsymbol{\theta}(\hat{\boldsymbol{\theta}};\mathbf{y}_i) \end{align} \tag{1}\] This makes a good bit of sense; if the gradient of the log-likelihood function at a point lies along a direction of large uncertainty, this datapoint will have a large influence on the MLE.
The expression in Equation 1 also makes sense when viewed through the lens of the limiting distribution for the MLE. Recall that from a previous lecture we have that a Taylor expansion for \(\ell_{\boldsymbol{\theta}}(\hat{\boldsymbol{\theta}})\) Using the Taylor expansion formula with remainders yields \[\begin{align*} \sqrt{n}(\hat{\boldsymbol{\theta}}(\mathbf{1}) - \boldsymbol{\theta}^\dagger) & = \left(-\frac{1}{n} \ell_{\boldsymbol{\theta}\boldsymbol{\theta}}(\boldsymbol{\theta}^\dagger) \right)^{-1} \frac{1}{\sqrt{n}} \ell_\boldsymbol{\theta}(\boldsymbol{\theta}^\dagger) + o_p(1) \end{align*}\] The plug-in estimator for the right-hand side at \(\boldsymbol{\theta}^\dagger = \hat{\boldsymbol{\theta}}(\mathbf{1})\) yields: \[\begin{align*} \sqrt{n}(\hat{\boldsymbol{\theta}}(\mathbf{1}) - \boldsymbol{\theta}^\dagger) & = \left(-\frac{1}{n} \ell_{\boldsymbol{\theta}\boldsymbol{\theta}}(\hat{\boldsymbol{\theta}}(\mathbf{1}))\right)^{-1} \frac{1}{\sqrt{n}} \ell_\boldsymbol{\theta}(\hat{\boldsymbol{\theta}}(\mathbf{1})) + o_p(1) \\ & = \left(-\frac{1}{n} \ell_{\boldsymbol{\theta}\boldsymbol{\theta}}(\hat{\boldsymbol{\theta}}(\mathbf{1}))\right)^{-1} \frac{1}{\sqrt{n}} \sum_{i=1}^n\ell_\boldsymbol{\theta}(\hat{\boldsymbol{\theta}}(\mathbf{1}); \mathbf{y}_i) + o_p(1) \end{align*}\] Dividing each side by \(\sqrt{n}\) yields \[\begin{align*} \hat{\boldsymbol{\theta}}(\mathbf{1}) - \boldsymbol{\theta}^\dagger = \left(-\frac{1}{n} \ell_{\boldsymbol{\theta}\boldsymbol{\theta}}(\hat{\boldsymbol{\theta}}(\mathbf{1}))\right)^{-1} \frac{1}{\sqrt{n}} \sum_{i=1}^n\ell_\boldsymbol{\theta}(\hat{\boldsymbol{\theta}}(\mathbf{1}); \mathbf{y}_i) + o_p(1/\sqrt{n}) \end{align*}\] Thus, asymptotically, each observation \((t_i, \delta_i, \mathbf{z}_i)\) perturbs the deviation between the MLE and the true value by approximately: \[\begin{align*} \left(-\ell_{\boldsymbol{\theta}\boldsymbol{\theta}}(\hat{\boldsymbol{\theta}}(\mathbf{1}))\right)^{-1} \ell_\boldsymbol{\theta}(\hat{\boldsymbol{\theta}}(\mathbf{1}); \mathbf{y}_i). \end{align*}\]
1.0.1 Linear regression
In the case of the linear regression model with normally distributed errors and known variance, we have the following results: \[\ell_\boldsymbol{\theta}(\hat{\boldsymbol{\theta}}(\mathbf{1}); \mathbf{y}_i) = \mathbf{z}_i (y_i - \hat{\boldsymbol{\beta}} \mathbf{z}_i )\] and \[\left(-\ell_{\boldsymbol{\theta}\boldsymbol{\theta}}(\hat{\boldsymbol{\theta}}(\mathbf{1}))\right)^{-1} = -(\mathbf{Z}^T \mathbf{Z})^{-1}.\] Assuming that \(\mathbf{Z}\) is full column rank, we can decompose the variance covariance matrix as: \[-(\mathbf{Z}^T \mathbf{Z})^{-1} = -\mathbf{Q} A \mathbf{Q}^T\] where \(\mathbf{Q}\) is a matrix with the orthonormal eigenvectors of \(\mathbf{Z}^T \mathbf{Z}\) as columns. Thus the influence of the \(j^\mathrm{th}\) datapoint is \[-\mathbf{Q} A \mathbf{Q}^T \mathbf{z}_i (y_i - \hat{\boldsymbol{\beta}} \mathbf{z}_i).\] If \(\mathbf{z}_i\) lies in a direction of large uncertainty for the variance-covariance matrix (i.e. the vector is aligned with the eigenvector associated with a large eigenvalue), and there is a large fitted residual, the datapoint will have a lot of influence on at least one of the coefficients.
Compare this to the exact calculation of the influence on \(\hat{\beta}\) from removing one point in a linear regression model: \[\hat{\beta}(\mathbf{w}) - \hat{\beta}(\mathbf{1}) = \frac{-\mathbf{Q} A \mathbf{Q}^T \mathbf{z}_i (y_i - \hat{\boldsymbol{\beta}} \mathbf{z}_i)}{1 - \mathbf{z}_i^T\mathbf{Q} A \mathbf{Q}^T \mathbf{z}_i}\]