Lecture 12
Another thing to note is that if the information matrix is block diagonal, then the gradients corresponding to one parameter block can’t influence the MLE of the opposing parameter block.
For notational ease: \[\Delta \hat{\boldsymbol{\theta}}_j = \hat{\boldsymbol{\theta}}(\mathbf{1}) - \hat{\boldsymbol{\theta}}(\mathbf{w}_{(j)}).\]
(Collett 1994), citing (Hall, Rogers, and Pregibon 1982), suggests standardizing the sensitivity to the control for the inverse of the variance-covariance of the estimated \(\hat{\boldsymbol{\theta}}\), namely: \[(\Delta \hat{\boldsymbol{\theta}}_j)^T j(\hat{\boldsymbol{\theta}}) (\Delta \hat{\boldsymbol{\theta}}_j).\] Recall that \[j(\hat{\boldsymbol{\theta}}) = -\ell_{\boldsymbol{\theta}\boldsymbol{\theta}}(\hat{\boldsymbol{\theta}})\] and from the last lecture:
\[ \Delta \hat{\boldsymbol{\theta}}_j \approx \left(-\ell_{\boldsymbol{\theta}\boldsymbol{\theta}}(\hat{\boldsymbol{\theta}})\right)^{-1} \ell_\boldsymbol{\theta}(\hat{\boldsymbol{\theta}};\mathbf{y}_i) \]
This leads to the tidy expression: \[\begin{align} \ell_\boldsymbol{\theta}(\hat{\boldsymbol{\theta}};\mathbf{y}_i)^T \left(-\ell_{\boldsymbol{\theta}\boldsymbol{\theta}}(\hat{\boldsymbol{\theta}})\right)^{-1} \ell_\boldsymbol{\theta}(\hat{\boldsymbol{\theta}};\mathbf{y}_i). \end{align} \tag{1}\] Alternatively, we can use the sandwich estimator for the asymptotic variance covariance matrix: \[\begin{align*} \hat{\Sigma}_{R} & = \left(-\frac{1}{n}\ell_{\boldsymbol{\theta}\boldsymbol{\theta}}(\hat{\boldsymbol{\theta}})\right)^{-1} \left(\frac{1}{n}\sum_{i=1}^n\ell_\boldsymbol{\theta}(\hat{\boldsymbol{\theta}};\mathbf{y}_i)\ell_\boldsymbol{\theta}(\hat{\boldsymbol{\theta}};\mathbf{y}_i)^T\right)\left(-\frac{1}{n}\ell_{\boldsymbol{\theta}\boldsymbol{\theta}}(\hat{\boldsymbol{\theta}})\right)^{-1} \end{align*}\] Note that this is variance/covariance matrix for \(\sqrt{n}(\hat{\theta}_n - \theta^\dagger)\). We instead want to get a sense for the variance/covariance matrix for \(\hat{\theta}_n\), so we divide the expression by \(\sqrt{n}\), leading to a variance estimate that is scaled by \(n^{-1}\). Using the statistic \[(\Delta \hat{\boldsymbol{\theta}}_j)^T (n^{-1}\hat{\Sigma}_{R})^{-1} (\Delta \hat{\boldsymbol{\theta}}_j)\] and noting the following equality: \[\begin{align*} (n^{-1} \hat{\Sigma}_{R})^{-1} & = -\ell_{\boldsymbol{\theta}\boldsymbol{\theta}}(\hat{\boldsymbol{\theta}}) \left(\sum_{i=1}^n\ell_\boldsymbol{\theta}(\hat{\boldsymbol{\theta}};\mathbf{y}_i)\ell_\boldsymbol{\theta}(\hat{\boldsymbol{\theta}};\mathbf{y}_i)^T\right)^{-1}\left(-\ell_{\boldsymbol{\theta}\boldsymbol{\theta}}(\hat{\boldsymbol{\theta}})\right) \end{align*}\] yields: \[\ell_\boldsymbol{\theta}(\hat{\boldsymbol{\theta}};\mathbf{y}_i)^T\left(\sum_{i=1}^n\ell_\boldsymbol{\theta}(\hat{\boldsymbol{\theta}};\mathbf{y}_i)\ell_\boldsymbol{\theta}(\hat{\boldsymbol{\theta}};\mathbf{y}_i)^T\right)^{-1}\ell_\boldsymbol{\theta}(\hat{\boldsymbol{\theta}};\mathbf{y}_i).\] Let’s do an example where we can analytically calculate the influence score for a single observation on the parameter vector:
Example 1. Influence of datapoints in exponential regression model The inverse of the observed information is: \[\begin{align} \hat{\mathcal{I}}^{-1}(\hat{\lambda}, \hat{\beta}) = \frac{1}{\frac{(r_1 + r_2)e^{\hat{\beta}} T_2}{\hat{\lambda}} - e^{2 \hat{\beta}} T_2^2}\begin{bmatrix} \hat{\lambda} e^{\hat{\beta}} T_2 & -e^{\hat{\beta}} T_2 \\ -e^{\hat{\beta}} T_2 & \frac{r_1 + r_2}{\hat{\lambda}^2} \end{bmatrix} \end{align}\] which simplifies to: \[\begin{align} \begin{bmatrix} \frac{r_1}{T_1^2} & -\frac{1}{T_1}\\ -\frac{1}{T_1} & \frac{r_1 + r_2}{r_1 r_2} \end{bmatrix} \end{align}\] and the score equations are: \[\begin{align} \frac{\partial}{\partial \lambda} \ell(\lambda, \beta) & = \frac{\delta_i}{\lambda} - e^{z_i \beta} t_i\\ \frac{\partial}{\partial \beta} \ell(\lambda, \beta) & = \delta_i z_i - z_i \lambda e^{z_i \beta} t_i \end{align}\] which we evaluate at the MLE: \[\begin{align*} \hat{\lambda} & = \frac{r_1}{T_1} \\ \hat{e^\beta} & = \frac{r_2}{T_2}\frac{T_1}{r_1} \end{align*}\] to yield \[\begin{align} \frac{\partial}{\partial \lambda} \ell(\lambda, \beta) & = \frac{\delta_i T_1 }{r_1} - \left(\frac{r_2}{T_2}\frac{T_1}{r_1}\right)^{z_i} t_i\\ \frac{\partial}{\partial \beta} \ell(\lambda, \beta) & = \delta_i z_i -\left(\frac{r_2}{T_2}\right)z_i t_i \end{align}\]
For an individual with \(z_i = 0\), this gives the sensitivities: \[\begin{align} \begin{bmatrix} \frac{\delta_i - t_i \frac{r_1}{T_1}}{T_1}\\ \frac{t_i}{T_1} - \frac{\delta_i}{r_1} \end{bmatrix} = \begin{bmatrix} \frac{\delta_i - t_i \frac{r_1}{T_1}}{T_1}\\ \frac{t_i \frac{r_1}{T_1} - \delta_i}{r_1} \end{bmatrix} \end{align}\] These expressions make sense. At a mathematical level, they agree with the total derivatives for each function: \(\frac{r_1}{T_1}\) and \(-\log(r_1 / T_1)\). Our expression for the sensitivity of the MLE to the omission of one datapoint is in terms of the difference between the MLE of the full model and the MLE of the leave-one-observation-out model: \[\hat{\boldsymbol{\theta}}(\mathbf{1}) - \hat{\boldsymbol{\theta}}(\mathbf{w}_{(j)})\] This means that the change in total time at risk for a group \(j = 1, 2\), or \(T_j\), is positive, as is the change in total failures for each group: \[\begin{align} T_j - (T_j)_{(i)} & = t_i \\ r_j - (r_j)_{(i)} & = \delta_i \end{align}\]
Then the expression the \[\begin{align} \mathrm{d} \frac{r_1}{T_1} & = \frac{\partial}{\partial r_1} \frac{r_1}{T_1} \mathrm{d}r_1 + \frac{\partial}{\partial T_1} \frac{r_1}{T_1} \mathrm{d}T_1 \\ & = \frac{\mathrm{d}r_1}{T_1} - \mathrm{d}T_1\frac{r_1}{T_1^2} \\ & \approx \frac{\delta_i - t_i\frac{r_1}{T_1}}{T_1} \end{align}\] and \[\begin{align} \mathrm{d} -\log(r_1 / T_1) & = -\frac{\partial}{\partial r_1} \log(r_1 / T_1) \mathrm{d}r_1 - \frac{\partial}{\partial T_1} \log(r_1 / T_1) \mathrm{d}T_1 \\ & = \frac{\mathrm{d}T_1}{T_1} - \frac{\mathrm{d}r_1}{r_1} \\ & \approx \frac{t_i\frac{r_1}{T_1}-\delta_i}{r_1} \end{align}\]
It helps to think about the units of the parameter estimates. \(\lambda\) measures the rate of failures per unit time, while \(\beta\) measures the log of the relative rates of failure. Thus \(\beta\) is unitless. Remember that \[\delta_i - t_i \frac{r_j}{T_j}\] is the residual for an individual \(i\) in group \(j\). It compares the observed failure to the expected failure rate, which in the exponential model is just the estimated rate of failure times the time at risk for \(i\), or \(t_i\). When one removes an individual from group 1 the estimate for the rate of failure in group 1 declines by the residual expected failure per unit time. At the same time, the log relative rate of failure must increase by the residual failure per unit failure because the estimator for \(\beta\) is \(\log(r_2 / T_2) - \log(\hat{\lambda})\). Thus any change in \(\hat{\lambda}\) has an opposite change for \(\hat{\beta}\).
For an individual with \(z_i = 1\), the sensitivities are: \[\begin{align} \begin{bmatrix} 0\\ \frac{\delta_i}{r_2} - \frac{t_i}{T_2} \end{bmatrix} = \begin{bmatrix} 0\\ \frac{\delta_i - t_i \frac{r_2}{T_2}}{r_2} \end{bmatrix} \end{align}\] Again, this makes sense; \(\hat{\lambda} = \frac{r_1}{T_1}\), so omitting an individual in group \(2\) can’t change the MLE for \(\lambda\). Finally, given \(\hat{\beta} = \log(r_2 / T_2) - \log(\hat{\lambda})\), omitting a datapoint will decrease the failure rate estimate within group \(2\) by the residual scaled by the failure rate. Note the total derivative of \(\log(r_2 / T_2)\), as above, is: \[\begin{align} \mathrm{d} \log(r_2 / T_2) & = \frac{\mathrm{d}r_2}{r_2} - \frac{\mathrm{d}T_2}{T_2} \end{align}\]
We can also calculate the scaled total deviation. For \(z_i = 0\) we have: \[\begin{align} \frac{\left(t_i \frac{r_1}{T_1} - \delta_i\right)^2}{r_1} \end{align}\] and for \(z_i = 1\) we have \[\begin{align} \frac{\left(t_i \frac{r_2}{T_2} - \delta_i\right)^2}{r_2} \end{align}\]