Lecture 9

For many of the prior examples, a convenient estimator for the Fisher information is the average of the observed information. The observed information is just the negative of the matrix of second derivatives of the log-likelihood: \[\begin{align} -\ell_{\theta\theta}(\theta) = -\begin{bmatrix} \ddtA{\theta_1}\ell(\theta) & \ddtB{\theta_1}{\theta_2}\ell(\theta) & \dots & \ddtB{\theta_1}{\theta_p}\ell(\theta) \\ \ddtB{\theta_2}{\theta_1} \ell(\theta) & \ddtA{\theta_2} \ell(\theta) & \dots & \ddtB{\theta_2}{\theta_p}\ell(\theta) \\ \vdots & \vdots & \ddots & \vdots \\ \ddtB{\theta_p}{\theta_1} \ell(\theta) & \ddtB{\theta_p}{\theta_2} \ell(\theta) & \dots & \ddtA{\theta_p}\ell(\theta) \\ \end{bmatrix} \end{align}\] This is often denoted as \[j(\theta) \equiv -\ell_{\theta\theta}(\theta).\] Replacing \(\ell(\theta) = \sum_i \log f_\theta(X_i)\) and using the fact that derivatives are linear operators: \[\begin{align} j(\theta) = -\sum_i \begin{bmatrix} \ddtA{\theta_1}\log f_\theta(X_i) & \ddtB{\theta_1}{\theta_2}\log f_\theta(X_i) & \dots & \ddtB{\theta_1}{\theta_p}\log f_\theta(X_i) \\ \ddtB{\theta_2}{\theta_1} \log f_\theta(X_i) & \ddtA{\theta_2} \log f_\theta(X_i) & \dots & \ddtB{\theta_2}{\theta_p}\log f_\theta(X_i) \\ \vdots & \vdots & \ddots & \vdots \\ \ddtB{\theta_p}{\theta_1} \log f_\theta(X_i) & \ddtB{\theta_p}{\theta_2} \log f_\theta(X_i) & \dots & \ddtA{\theta_p}\log f_\theta(X_i) \\ \end{bmatrix} \end{align}\] we can see that the natural estimator of \(\mathcal{I}(\theta)\) is the average observed information, which does indeed converge in probability to the Fisher information \[\frac{1}{n} j(\theta) \overset{p}{\to} \mathcal{I}(\theta).\] Of course, typically we won’t know \(\theta\) (unless we’re evaluating \(i(\theta)\) at \(\theta_0\)), so we use the plug-in estimator, or \(j(\hat{\theta}_n)\) which still converges in probability to the Fisher information: \[\frac{1}{n} j(\hat{\theta}_n) \overset{p}{\to} \mathcal{I}(\theta).\]

1 Tests in terms of observed information

When we use observed information in place of the Fisher information, the Wald and Score tests look a bit different:

1.1 Wald test with the observed information

\[n (\hat{\theta}_n - \theta_0)^T \frac{1}{n}j(\hat{\theta}_n) (\hat{\theta}_n - \theta_0) = (\hat{\theta}_n - \theta_0)^T j(\hat{\theta}_n) (\hat{\theta}_n - \theta_0) \overset{\text{asympt.}}{\sim} \chi^2(p)\]

1.2 Score test with the observed information

\[\begin{align*} T_S & = \left(\frac{1}{\sqrt{n}} \nabla_\theta \ell(\theta) \mid_{\theta = \hat{\theta}_0} \right)^T (\frac{1}{n}j(\hat{\theta}_0))^{-1}\frac{1}{\sqrt{n}} \nabla_\theta \ell(\theta) \mid_{\theta = \hat{\theta}_0} \\ & = \left(\frac{1}{\sqrt{n}} \nabla_\theta \ell(\theta) \mid_{\theta = \hat{\theta}_0} \right)^T n(j(\hat{\theta}_0))^{-1}\frac{1}{\sqrt{n}} \nabla_\theta \ell(\theta) \mid_{\theta = \hat{\theta}_0} \\ & = \left(\nabla_\theta \ell(\theta) \mid_{\theta = \hat{\theta}_0} \right)^T j(\hat{\theta}_0)^{-1}\nabla_\theta \ell(\theta) \mid_{\theta = \hat{\theta}_0} \end{align*}\]

2 Composite tests

This section is an expansion of Appendix B in (Klein, Moeschberger, et al. 2003).

We can modify all of our tests to accommodate testing a subset of the parameters. Typically we’ll have a subset of our parameter vector, let’s call it \(\psi\), that we’re interested in, and we have another subset, \(\phi\), that are nuisance parameters. In the our prior exponential regression example, we’ll likely be interested in testing if \(\beta\neq 0\), and thus we won’t care about testing \(\lambda\).

Let’s let \(\theta=(\psi, \phi)\), and let \(\theta \in \R^p\) so \(\psi\in\R^k\), \(k < p\), \(\phi\in\R^{p-k}\). Our null hypothesis will be: \[H_0: \psi = \psi_0.\] Let \(\hat{\phi}(\psi_0)\) be the MLE for the nuisance parameter with \(\psi\) fixed under the null hypothesis. We’ll also partition the information matrix into a 2 by 2 block matrix: \[\mathcal{I}(\psi, \phi) = \begin{bmatrix} \Exp{-\nabla^2_\psi \log f_\theta(X_1)} & \Exp{-\nabla^2_{\psi, \phi} \log f_\theta(X_1)} \\ \Exp{-\nabla^2_{\psi, \phi} \log f_\theta(X_1)} & \Exp{-\nabla^2_\phi \log f_\theta(X_1)} \end{bmatrix} = \begin{bmatrix} \mathcal{I}_{\psi,\psi} & \mathcal{I}_{\psi,\phi} \\ \mathcal{I}_{\psi,\phi}^T & \mathcal{I}_{\phi,\phi} \end{bmatrix}\] The observed information matrix can also be partioned the same way: \[j(\psi^\prime, \phi^\prime) = \begin{bmatrix} -\ell_{\psi\psi}(\psi^\prime, \phi^\prime) & -\ell_{\psi\phi}(\psi^\prime, \phi^\prime) \\ -\ell_{\psi\phi}(\psi^\prime, \phi^\prime)^T & -\ell_{\phi\phi}(\psi^\prime, \phi^\prime) \end{bmatrix} = \begin{bmatrix} j_{\psi,\psi}(\psi^\prime,\phi^\prime) & j_{\psi,\phi}(\psi^\prime,\phi^\prime) \\ j_{\psi,\phi}(\psi^\prime,\phi^\prime)^T & j_{\phi,\phi}(\psi^\prime,\phi^\prime) \end{bmatrix}\]

The inverse can also be partitioned into a 2 by 2 block matrix: \[\mathcal{I}(\psi, \phi)^{-1} = \begin{bmatrix} \mathcal{I}^{\psi,\psi} & \mathcal{I}^{\psi,\phi} \\ \left(\mathcal{I}^{\psi,\phi}\right)^T & \mathcal{I}^{\phi,\phi} \end{bmatrix}\] The expression for \(\mathcal{I}^{\psi,\psi}\) can be found from the block matrix inversion formula: \[\begin{align} \mathcal{I}^{\psi,\psi} & = \mathcal{I}_{\psi,\psi}^{-1} + \mathcal{I}_{\psi,\psi}^{-1}\mathcal{I}_{\psi,\phi}\left(\mathcal{I}_{\phi,\phi} -\mathcal{I}_{\psi,\phi}^T\mathcal{I}_{\psi,\psi}^{-1} \mathcal{I}_{\psi,\phi}\right)^{-1}\mathcal{I}_{\psi,\phi}^T\mathcal{I}_{\psi,\psi}^{-1} \\ & = \left(\mathcal{I}_{\psi,\psi} - \mathcal{I}_{\psi,\phi} \mathcal{I}_{\phi,\phi}^{-1}\mathcal{I}_{\psi,\phi}^T\right)^{-1} \end{align} \tag{1}\]

All of the same notation will be used for the observed information, \(j(\psi, \phi)\):

\[j(\psi^\prime, \phi^\prime)^{-1} = \begin{bmatrix} j^{\psi,\psi}(\psi^\prime,\phi^\prime) & j^{\psi,\phi}(\psi^\prime,\phi^\prime) \\ j^{\psi,\phi}(\psi^\prime,\phi^\prime)^T & j^{\phi,\phi}(\psi^\prime,\phi^\prime) \end{bmatrix}\]

2.1 Composite Wald test

Again using normal distribution theory, we can derive the Wald test with the observed information: \[\sqrt{n}(\hat{\psi}_n - \psi_0) \overset{d}{\to} \mathcal{N}(0, \mathcal{I}^{\psi,\psi}).\] The Wald test statistic is then: \[\begin{align*} T_W = \sqrt{n}(\hat{\psi}_n - \psi_0) ^T \left(\mathcal{I}^{\psi,\psi}\mid_{\psi = \psi_0, \phi = \phi_0}\right)^{-1}(\hat{\psi}_n - \psi_0)\sqrt{n} \end{align*}\] Using the appropriate transformation for the observed information in place of the Fisher information, we get \[\begin{align} T_W = (\hat{\psi}_n - \psi_0) ^T \left(j^{\psi,\psi}(\hat{\psi}_n, \hat{\phi}_n)\right)^{-1}(\hat{\psi}_n - \psi_0) \overset{d}{\to} \chi^2_k \end{align}\] This simplifies by plugging in the Schur complement of \(j_{\phi,\phi}(\hat{\psi}_n, \hat{\phi}_n)\), or \(j_{\psi,\psi}(\hat{\psi}_n, \hat{\phi}_n) - j_{\psi,\phi}(\hat{\psi}_n, \hat{\phi}_n) j_{\phi,\phi}(\hat{\psi}_n, \hat{\phi}_n)^{-1}j_{\psi,\phi}(\hat{\psi}_n, \hat{\phi}_n)^T\), which was the inverse of \(j^{\psi,\psi}(\hat{\psi}_n, \hat{\phi}_n)\), or \[ T_W = (\hat{\psi}_n - \psi_0) ^T \left(j_{\psi,\psi}(\hat{\psi}_n, \hat{\phi}_n) - j_{\psi,\phi}(\hat{\psi}_n, \hat{\phi}_n) j_{\phi,\phi}(\hat{\psi}_n, \hat{\phi}_n)^{-1}j_{\psi,\phi}(\hat{\psi}_n, \hat{\phi}_n)^T\right)(\hat{\psi}_n - \psi_0) \]

2.2 Composite Score test

The composite score test is a bit more complicated. The joint asymptotic distribution of the score is: \[\sqrt{n} \frac{1}{n} \ell_{\theta}(\psi_0, \hat{\phi}(\psi_0)) \overset{d}{\to} \mathcal{N}\left(0, \begin{bmatrix} \mathcal{I}_{\psi,\psi} & \mathcal{I}_{\psi,\phi} \\ \mathcal{I}_{\psi,\phi}^T & \mathcal{I}_{\phi,\phi} \end{bmatrix}\right)\] But when we have a nuisance parameter, under the null distribution we solve the score equations \[\ell_{\phi}(\psi_0, \phi) = 0,\] leading to an MLE for \(\phi\), \(\hat{\phi}(\psi_0)\), that is dependent on \(\psi_0\). This means the distribution for \(\sqrt{n} \frac{1}{n} \ell_{\psi}(\psi_0, \hat{\phi}(\psi_0))\) needs to condition on the score equations for \(\psi\) being zero. If the score equations are asymptotically normally distributed, then the score equations for \(\psi\) are conditionally normal. Recall that if vectors \(X, Y\) are multivariate normal with marginal variance covariance matrices \(\Sigma_X, \Sigma_Y\) and \(\Sigma_{X,Y}\) is the covariance matrix of \(X\) with \(Y\), then \(X \mid Y\) is multivariate normal with parameters \[\Exp{X} + \Sigma_{X,Y} \Sigma_Y^{-1}(Y - \Exp{Y}),\quad \Sigma_X - \Sigma_{X,Y} \Sigma_Y^{-1}\Sigma_{X,Y}^T.\] In our case, the marginal mean of the score equations are zero, and \(Y \equiv \ell_{\phi}(\psi_0, \hat{\phi}(\psi_0))\) is zero, so the conditional distribution of the score of \(\psi\) is \[\sqrt{n}\frac{1}{n} \ell_{\psi}(\psi_0, \hat{\phi}(\psi_0)) \overset{d}{\to} \mathcal{N}(0, \mathcal{I}_{\psi,\psi} - \mathcal{I}_{\psi,\phi} \mathcal{I}_{\phi,\phi}^{-1}\mathcal{I}_{\phi,\psi}^T).\] The test statistic is then \[\begin{align*} n^{-1/2} \ell_{\psi}(\psi_0, \hat{\phi}(\psi_0))^T\left(\mathcal{I}_{\psi,\psi} - \mathcal{I}_{\psi,\phi} \mathcal{I}_{\phi,\phi}^{-1}\mathcal{I}_{\phi,\psi}^T\right)^{-1} n^{-1/2} \ell_{\psi}(\psi_0, \hat{\phi}(\psi_0)) \end{align*}\] as we showed in Equation 1, the inverse matrix is the same as \(\mathcal{I}^{\psi,\psi}\), so, subbing in our observed information matrix again, we get the final \[\begin{align*} T_S = \ell_{\psi}(\psi_0, \hat{\phi}(\psi_0))^T j^{\psi, \psi}(\psi_0, \hat{\phi}(\psi_0)) \ell_{\psi}(\psi_0, \hat{\phi}(\psi_0)) \end{align*}\] which is asymptotically distributed as \(\chi^2_k\).

2.3 Composite likelihood ratio test

The composite likelihood ratio test is similar to the likelihood ratio test: \[T_{LR} = 2(\ell(\hat{\psi},\hat{\phi}) - \ell(\psi_0,\hat{\phi}(\psi_0)))\] and this is again asymptotically distributed as \(\chi^2_k\)

Example 1. Continued relative risk example Suppose we are interested in testing the hypothesis \(H_0: \beta = 0\) vs \(H_a: \beta \neq 0\).

Recall the definitions of \(r_1, r_2, T_1, T_2\): \[\begin{alignat*} {2} r_1 & = \sum_{i=1}^n (1 - z_i) \delta_i\quad && T_1 = \sum_{i=1}^n (1 - z_i) t_i \\ r_2 & = \sum_{i=1}^n z_i \delta_i\quad && T_2 = \sum_{i=1}^n z_i t_i \end{alignat*}\] We showed in our prior example that the log-likelihood was: \[\begin{align} \ell(\lambda, \beta) = (r_1 + r_2)\log\lambda -\lambda T_1 + r_2 \beta -\lambda e^\beta T_2 \end{align}\] The score equations are \[\begin{align*} \frac{\partial}{\partial \lambda} \ell(\lambda, \beta) &: \frac{r_1 + r_2}{\lambda} - T_1 - e^\beta T_2 \\ \frac{\partial}{\partial \beta} \ell(\lambda, \beta) &: r_2 - \lambda e^\beta T_2 \end{align*}\] and the matrix of second derivatives of the log-likelihood with respect to \(\lambda, \beta\), also known as the observed information, is \[\begin{align} \nabla^2_{\lambda, \beta} \ell(\lambda, \beta) = \begin{bmatrix} \frac{r_1 + r_2}{\lambda^2} & e^\beta T_2 \\ e^\beta T_2 & \lambda e^\beta T_2 \end{bmatrix} \end{align}\] The unrestricted MLE, (i.e. the MLE under the alternative hypothesis), is: \[\begin{align*} \hat{\lambda} & = \frac{r_1}{T_1} \\ \hat{e^\beta} & = \frac{r_2}{T_2}\frac{T_1}{r_1} \end{align*}\] Under the null hypothesis that \(\beta = 0\), we have the restricted likelihood: \[\begin{align} \ell(\lambda, \beta=0) = (r_1 + r_2)\log\lambda -\lambda T_1 -\lambda T_2 \end{align}\] which can be differentiated with respect to \(\lambda\), set to zero, and solved for \(\lambda\): \[\begin{align} \hat{\lambda}_0 & = \frac{r_1 + r_2}{T_1 + T_2} \end{align}\] The inverse of the observed information evaluated at the unrestricted MLE was shown to be \[\begin{align} \frac{r_1 + r_2}{r_1 r_2} \end{align}\] The inverse of the observed information is: \[\begin{align} \hat{\mathcal{I}}^{-1}(\lambda, \beta) = \frac{1}{\frac{(r_1 + r_2)e^\beta T_2}{\lambda} - e^{2 \beta} T_2^2}\begin{bmatrix} \lambda e^\beta T_2 & -e^\beta T_2 \\ -e^\beta T_2 & \frac{r_1 + r_2}{\lambda^2} \end{bmatrix} \end{align}\] which when the \(2,2\) element is evaluated at the \(\hat{\lambda}_0\), or \[\hat{\mathcal{I}}^{-1}(\hat{\lambda}_0, 0)_{2,2} = \frac{(T_1 + T_2)^2}{(r_1 + r_2) T_1 T_2}\] Now for the test statistics:

  • Likelihood ratio test: After some algebra, we get \[T_{LR} = 2 r_1 \left(\log\left(\frac{r_1}{T_1}\right)- \log\left(\frac{r_1 + r_2}{T_1 + T_2}\right)\right)+ 2 r_2 \left(\log\left(\frac{r_2}{T_2}\right)- \log\left(\frac{r_1 + r_2}{T_1 + T_2}\right)\right)\]

  • Wald test: The test statistic is: \[T_W = \left(\log\frac{r_2 / T_2}{r_1/T_1}\right)^2 \frac{r_1 r_2}{r_1 + r_2}.\]

  • Score test The starting test statistic is: \[T_S = \left(r_2 - (r_1+r_2)\frac{T_2 }{T_1 + T_2} \right)^2 \frac{(T_1 + T_2)^2}{(r_1 + r_2) T_1 T_2}.\] This is sort of interesting because it looks a bit like the log-rank statistic! \(\frac{T_2}{T_1 + T_2}\) is a bit like the proportion of time at risk the second group experienced, and the expected total failures in the second group is this proportion multiplied by the total failures in both groups. It’s not too hard to see why you might want to reject the null that \(\beta=0\) if this statistic were large. This simplifies to \[T_S = \frac{(T_1 r_2 - T_2 r_1)^2}{(r_1 + r_2)T_1 T_2 }.\]

For an observed dataset of \(r_1 = 10, r_2 = 12, T_1 = 25, T_2 = 27\), they all yield values around \(0.06\), which is far below the critical value of \(3.84\), which is the \(95^\mathrm{th}\) quantile from a \(\chi^2_1\).

References

Klein, John P, Melvin L Moeschberger, et al. 2003. Survival Analysis: Techniques for Censored and Truncated Data. Vol. 1230. Springer.