Lecture 8

1 Example continued

In the preceding example, we shied away from using the Fisher information because \(T_2\) was not easily accessible. But we can use the results from [[exmp:mle-exp]] to derive an exact expression for the asymptotic sampling variance for the MLE.

1.1 Fisher information

Continued example This is an expansion of the example in (Collett 1994). \[\begin{align} \frac{\partial}{\partial \lambda}\left(\frac{\partial}{\partial \lambda}\ell(\lambda, \psi)\right)& = -\frac{r_1 + r_2}{\lambda^2} \\ \frac{\partial}{\partial \beta}\left(\frac{\partial}{\partial \lambda}\ell(\lambda, \psi)\right)& = -e^\beta T_2 \\ \frac{\partial}{\partial \beta}\left(\frac{\partial}{\partial \beta}\ell(\lambda, \psi)\right)& = -\lambda e^\beta T_2 \end{align}\]
We know that \[\Exp{r_1} = n_1\Exp{1 - e^{-\lambda C_i}}{C_i}, \, \Exp{r_2} = n_2\Exp{1 - e^{-\lambda e^\beta C_i}}{C_i}, \textrm{and} \, \Exp{T_2} = n_2 \frac{1}{\lambda e^\beta} \Exp{1 - e^{-\lambda e^\beta C_i}}{C_i}\] Then the Fisher information is \[\begin{align} \begin{bmatrix} \frac{n_1\Exp{1 - e^{-\lambda C_i}}{C_i} + n_2\Exp{1 - e^{-\lambda e^\beta C_i}}{C_i}}{\lambda^2} & \frac{1}{\lambda} n_2\Exp{1 - e^{-\lambda e^\beta C_i}}{C_i} \\ \frac{1}{\lambda} n_2\Exp{1 - e^{-\lambda e^\beta C_i}}{C_i} & n_2\Exp{1 - e^{-\lambda e^\beta C_i}}{C_i} \end{bmatrix} \end{align}\] Let \(\Exp{r_{i1}} = \Exp{1 - e^{-\lambda C_i}}{C_i}\) and \(\Exp{r_{i2}} = \Exp{1 - e^{-\lambda e^\beta C_i}}{C_i}\). We know the asymptotic variance of the MLE is the inverse of the Fisher information matrix. The inverse is: \[\begin{align} \frac{\lambda^2}{n_{1} n_{2} \Exp{r_{i1}}\Exp{r_{i2}}} \begin{bmatrix} n_2 \Exp{r_{i2}} & -n_2\Exp{r_{i2}}/\lambda \\ -n_2\Exp{r_{i2}}/\lambda & \frac{n_1\Exp{r_{i1}} + n_2\Exp{r_{i2}}}{\lambda^2} \end{bmatrix} = \begin{bmatrix} \frac{\lambda^2}{n_1\Exp{r_{i1}}} & -\frac{\lambda}{n_1 \Exp{r_{i1}}} \\ -\frac{\lambda}{n_1 \Exp{r_{i1}}} & \frac{n_1\Exp{r_{i1}} + n_2\Exp{r_{i2}}}{n_1 n_2 \Exp{r_{i1}}\Exp{r_{i2}}} \end{bmatrix} \end{align}\] So the asymptotic standard error for \(\beta\) is \[\sqrt{\frac{n_1\Exp{1 - e^{-\lambda C_i}}{C_i} + n_2\Exp{1 - e^{-\lambda e^\beta C_i}}{C_i}}{n_1 n_2\Exp{1 - e^{-\lambda C_i}}{C_i}\Exp{1 - e^{-\lambda e^\beta C_i}}{C_i}}}\]

2 Asymptotic interlude

As you’ve already no doubt gathered, many of the results for inference and hypothesis testing in survival analysis rely on asymptotic normality of the MLE. Before we get too much further into the quarter, I thought it would be a good idea to review the asymptotic results for maximum likelihood. This outline of results is from (Keener 2010).

2.1 Sketch of asymptotic normality of MLE

Let \(X_i, i = 1, 2, \dots\) be distributed \(i.i.d.\) with density \(f_\theta\) where \(\theta \in \R^p\). We suppose that the support of \(X_i\) does not depend on \(\theta\), and that our MLE’s are consistent for \(\theta\). This is pretty mild, and only requires that likelihood ratios are integrable and our model is identifiable.

Let the log-likelihood be denoted \(\ell(\theta) = \sum_{i=1}^n \log f_{X}(x; \theta)\), in which we suppress the dependence of the likelihood on the data. Through an abuse of notation we let \(\ell(\theta; x_i) \equiv \log f_{X}(x_i; \theta)\). We denote the gradient of the log-likelihood with respect to \(\theta\) evaluated at \(\theta^\prime\) as: \[ \ell_{\theta}(\theta^\prime) \equiv \sum_{i=1}^n \nabla_\theta \log f_X(x_i; \theta)\mid_{\theta = \theta^\prime},\,\ell_{\theta}(\theta^\prime; x_i) \equiv \nabla_\theta \log f_X(x_i; \theta)\mid_{\theta = \theta^\prime} \] The Hessian of the log-likelihood (i.e. the matrix of second derivatives of the log-likelihood) is: \[ \ell_{\theta\theta}(\theta^\prime) \equiv \sum_{i=1}^n \nabla^2_\theta \log f_X(x_i; \theta)\mid_{\theta = \theta^\prime},\,\ell_{\theta\theta}(\theta^\prime;x_i) \equiv \nabla^2_\theta \log f_X(x_i; \theta)\mid_{\theta = \theta^\prime} \] Given that \(\theta \in \R^p\), we’ll denote an element of the vector \(\ell_\theta(\theta^\prime)\) as \((\ell_\theta(\theta^\prime))_j\) and a row of \(\ell_{\theta\theta}(\theta^\prime)\) as \((\ell_{\theta\theta}(\theta^\prime))_j\). Finally, let \((\ell_{\theta\theta\theta}(\theta^\prime))_j = \nabla^2_\theta (\ell_{\theta}(\theta^\prime))_j\) which is a \(p \times p\) matrix.

Further assumptions will be needed:

  1. Twice continuously differentiable \(\ell_\theta(\cdot;x)\)

  2. For \(\theta \in N(\theta^\dagger)\), \(\sup_{\theta}\ell_{\theta\theta\theta}(\theta) \leq g(x)\) where \(g(x)\) is integrable with respect to \(f_\theta(x) dx\).

  3. \(\Exp{-\ell_{\theta\theta}(\theta; X_i)}\) is invertible

We can expand each dimension of the gradient of the log-likelihood evaluated at the MLE \(\ell(\hat{\theta})\) around the true parameter value \(\theta^\dagger\) in a two-term Taylor expansion: \[ \begin{align*} (\ell_\theta(\hat{\theta}_n))_j = (\ell_\theta(\theta^\dagger))_j + (\ell_{\theta\theta}(\theta^\dagger))_j (\hat{\theta}_n - \theta^\dagger) + \frac{1}{2}(\hat{\theta}_n - \theta^\dagger)^T (\ell_{\theta\theta\theta}(\tilde{\theta}_{n,j}))_j(\hat{\theta}_n - \theta^\dagger) \end{align*} \] where \(\tilde{\theta}_{n,j}\) is a point on the chord between \(\hat{\theta}_n\) and \(\theta^\dagger\) and may depend on the coordinate \(j\).

Noting that \(\ell_\theta(\hat{\theta}_n)_j = 0\) for all \(j\), we get the set of \(p\) linear equations: \[\begin{align*} (\ell_\theta(\theta^\dagger))_j = -(\ell_{\theta\theta}(\theta^\dagger))_j (\hat{\theta}_n - \theta^\dagger) - \frac{1}{2}(\hat{\theta}_n - \theta^\dagger)^T (\ell_{\theta\theta\theta}(\tilde{\theta}_{n,j}))_j(\hat{\theta}_n - \theta^\dagger) \end{align*}\] Multiplying both sides by \(\frac{\sqrt{n}}{n}\) gives: \[ \begin{align*} \sqrt{n}\lp\frac{1}{n} (\ell_\theta(\theta^\dagger))_j\rp & = -\left(\frac{1}{n}(\ell_{\theta\theta}(\theta^\dagger))_j\right) \sqrt{n}(\hat{\theta}_n - \theta^\dagger) - \frac{1}{2}(\hat{\theta}_n - \theta^\dagger)^T \lp \frac{1}{n}(\ell_{\theta\theta\theta}(\tilde{\theta}_{n,j}))_j\rp\sqrt{n}(\hat{\theta}_n - \theta^\dagger) \\ & = \lp -\left(\frac{1}{n}(\ell_{\theta\theta}(\theta^\dagger))_j\right) - \frac{1}{2}(\hat{\theta}_n - \theta^\dagger)^T \lp \frac{1}{n}(\ell_{\theta\theta\theta}(\tilde{\theta}_{n,j}))_j\rp\rp\sqrt{n}(\hat{\theta}_n - \theta^\dagger) \end{align*} \] The term \(\frac{1}{n}(\ell_{\theta\theta}(\theta^\dagger))_j \overset{p}{\to} \Exp{(\ell_{\theta\theta}(\theta^\dagger))_j}\) by the WLLN, while \[ \frac{1}{2}(\hat{\theta}_n - \theta^\dagger)^T \lp \frac{1}{n}(\ell_{\theta\theta\theta}(\tilde{\theta}_{n,j}))_j\rp \overset{p}{\to} 0 \] by the WLLN, the boundedness condition of the third-order derivatives and the consistency of our estimator.

To be precise, we want to show that \[ \lim_{n\to\infty}\Prob{\norm{(\hat{\theta}_n - \theta^\dagger)^T \lp \frac{1}{n}(\ell_{\theta\theta\theta}(\tilde{\theta}_{n,j}))_j\rp} > \epsilon}{} = 0 \] \[ \begin{aligned} \Prob{\norm{(\hat{\theta}_n - \theta^\dagger)^T \lp \frac{1}{n}(\ell_{\theta\theta\theta}(\tilde{\theta}_{n,j}))_j\rp} > \epsilon}{} & \leq \Prob{\norm{(\hat{\theta}_n - \theta^\dagger)^T} \norm{ \frac{1}{n}(\ell_{\theta\theta\theta}(\tilde{\theta}_{n,j}))_j} > \epsilon}{} \\ & \leq \Prob{\norm{(\hat{\theta}_n - \theta^\dagger)^T} \norm{\frac{1}{n}\sum_{i=1}^n g(Y_i)} > \epsilon}{} \\ & \leq \Prob{\norm{(\hat{\theta}_n - \theta^\dagger)^T} \norm{\frac{1}{n}\sum_{i=1}^n g(Y_i) - \Exp{g(Y_1)}} > \frac{\epsilon}{2}}{} \\ & \quad + \Prob{\norm{(\hat{\theta}_n - \theta^\dagger)^T} \norm{\Exp{g(Y_1)}} > \frac{\epsilon}{2}}{} \end{aligned} \] The last line follows from a trick used in a proof in § 6.3 in Resnick (2019). If the following holds

\[ \begin{aligned} & \left\{\norm{(\hat{\theta}_n - \theta^\dagger)^T} \norm{\frac{1}{n}\sum_{i=1}^n g(Y_i) - \Exp{g(Y_1)}} \leq \frac{\epsilon}{2}\right\} \bigcap\\ & \left\{\norm{(\hat{\theta}_n - \theta^\dagger)^T} \norm{\Exp{g(Y_1)}} \leq \frac{\epsilon}{2}\right\} \end{aligned} \]

Then the triangle inequality implies that \[ \norm{(\hat{\theta}_n - \theta^\dagger)^T} \norm{\frac{1}{n}\sum_{i=1}^n g(Y_i)} \leq \epsilon. \] Taking complements and using Boole’s inequality (\(P(A \cup B) \leq P(A) + P(B)\)) yields the final inequality. Both terms converge to zero due to the convergence in probability of the MLE, and the WLLN applied to the empirical average of \(g(Y_i)\).

By Slutsky’s theorem, \[ -\left(\frac{1}{n}(\ell_{\theta\theta}(\theta^\dagger))_j\right) - \frac{1}{2}(\hat{\theta}_n - \theta^\dagger)^T \lp \frac{1}{n}(\ell_{\theta\theta\theta}(\tilde{\theta}_{n,j}))_j\rp \overset{p}{\to}-\Exp{(\ell_{\theta\theta}(\theta^\dagger))_j} \] Collecting our \(p\) equations into one set of equations yields: \[ \sqrt{n}\lp\frac{1}{n} \ell_\theta(\theta^\dagger)\rp = (-\Exp{\ell_{\theta\theta}(\theta^\dagger))} + o_p(1)) \sqrt{n}(\hat{\theta}_n - \theta^\dagger) \tag{1}\]

Writing out the expressions as explicit sums: \[\begin{align*} \frac{\sqrt{n}}{n} \sum_{i=1}^n \ell_\theta(\theta^\dagger; x_i) = (-\Exp{\ell_{\theta\theta}(\theta^\dagger))} + o_p(1)) \sqrt{n}(\hat{\theta}_n - \theta^\dagger) \end{align*} \tag{2}\]

The left-hand side of Equation 2 will be amenable to a multivariate version of the CLT. We’ll take the following multivariate CLT as given:

Theorem 1. Multivariate CLT, (Keener 2010) Let \(X_1, X_2, \dots\) be i.i.d random vectors in \(\R^k\) with a common mean \(\Exp{X_i} = \mu\) and common covariance matrix \(\Sigma = \Exp{(X_i - \mu)(X_i - \mu)^T}\). If \(\bar{X} = \frac{\sum_{i=1}^n X_i}{n}\), then \[\sqrt{n}(\bar{X} - \mu) \overset{d}{\to} \text{Normal}(0, \Sigma)\]

Recall that \[ \Exp{\ell_\theta(\theta; X_i)} = 0 \] By the multivariate central limit (MCLT) theorem, Equation 2 converges in distribution to a multivariate normal distribution with mean zero and covariance matrix \(\Exp{\ell_\theta(\theta;X_i)\ell_\theta(\theta;X_i)^T}\).

We’ll also need a lemma about the solutions to random linear equations:

Lemma 1 (Lemma 5.2 in (Lehmann and Casella 1998)) Suppose there are a set of \(p\) equations, \(j = 1, \dots, p\): \[\sum_{k=1}^p A_{jkn} Y_{kn} = T_{jn}.\] Let \(T_{1n}, \dots, T_{pn}\) converge in distribution to \(T_1, \dots, T_p\). Furthermore, suppose that for each \(j,k\), \(A_{jkn} \overset{p}{\to} a_{jk}\) such that the matrix \(A\) with \((j,k)^{\mathrm{th}}\) element \(a_{jk}\) is nonsingular. Then if the distribution of \(T_1, \dots, T_p\) has a disitribution with repsect to the Lebesgue measure over \(\R^p\), \(Y_{1n}, \dots, Y_{pn}\) tend in probability to \(A^{-1} T\).

Written in matrix form and using the fact that convergence in probability implies convergence in distribution: \[ T_n = A_n Y_n \implies Y_n \overset{d}{\to} A^{-1}T \]

We have that the left-hand side of Equation 1 converges in distribtuion to a multivariate normal distribution, and we have that the matrix on the RHS of Equation 1 convegens in probability to \(-\Exp{\ell_{\theta\theta}(\theta^\dagger))}\), which by assumption is invertble. Thus by Lemma 1 \(\sqrt{n}(\hat{\theta}_n - \theta^\dagger)\) converges in probability to \[\begin{align*} \sqrt{n}(\hat{\theta}_n - \theta^\dagger) \overset{p}{\to} (-\Exp{\ell_{\theta\theta}(\theta^\dagger; X_i)})^{-1} \Exp{\ell_\theta(\theta;X_i)\ell_\theta(\theta;X_i)^T}^{1/2}\mathcal{Z} \end{align*}\] where \(\mathcal{Z} \sim \text{Normal}(0, I_p)\), or \[\begin{align*} &\sqrt{n}(\hat{\theta}_n - \theta^\dagger) \overset{d}{\to} \mathcal{N}\left(0, (-\Exp{\ell_{\theta\theta}(\theta^\dagger; X_i)})^{-1} \Exp{\ell_\theta(\theta;X_i)\ell_\theta(\theta;X_i)^T}(-\Exp{\ell_{\theta\theta}(\theta^\dagger; X_i)})^{-1}\right) \end{align*}\]

Assuming further that \[ \Exp{\ell_{\theta\theta}(\theta^\dagger; X_i)} + \Exp{\ell_\theta(\theta;X_i)\ell_\theta(\theta;X_i)^T}=0 \implies (-\Exp{\ell_{\theta\theta}(\theta^\dagger; X_i)})^{-1}\Exp{\ell_\theta(\theta;X_i)\ell_\theta(\theta;X_i)^T} = I_p \] Putting this all together shows that \[\begin{align*} \sqrt{n}(\hat{\theta}_n - \theta^\dagger) \overset{d}{\to} \mathcal{N}(0, \mathcal{I}(\theta^\dagger)^{-1}) \end{align*}\] where \(\mathcal{I}(\theta^\dagger) = -\Exp{\ell_{\theta\theta}(\theta^\dagger; X_i)}\)

2.2 Estimators of variance-covariance matrix

In the previous section, we encountered several consistent estimators of the variance covariance matrix: \[\begin{align*} \lp-\frac{1}{n} \ell_{\theta\theta}(\hat{\theta}_n)\rp^{-1} & \overset{p}{\to} \mathcal{I}(\theta^\dagger)^{-1} \\ \lp\frac{1}{n} \ell_{\theta}(\hat{\theta}_n)\ell_{\theta}(\hat{\theta}_n)^T\rp^{-1} & \overset{p}{\to} \mathcal{I}(\theta^\dagger) \end{align*}\] These expressions assume that our inferential model matches the data generating model. In the event our inferential model is different than the true data generating model, it can be shown that the scaled MLE converges asymptotically to \[\begin{align*} &\sqrt{n}(\hat{\theta}_n - \theta^\dagger) \overset{d}{\to} \mathcal{N}\left(0, (-\Exp{\ell_{\theta\theta}(\theta^\dagger; X_i)})^{-1} \Exp{\ell_\theta(\theta;X_i)\ell_\theta(\theta;X_i)^T}(-\Exp{\ell_{\theta\theta}(\theta^\dagger; X_i)})^{-1}\right) \end{align*}\] where the key difference is that \(\theta^\dagger\) is no longer the parameter for the true data generating process, but is instead the parameter the minimizes the KL divergence between the assumed inferential model and the true distribution generating the data.

Thus, the following sandwich estimator for the variance covariance matrix is often preferred over either of the above expressions: \[\begin{align} \hat{\Sigma}_{R} & = (-\Prob{\ell_{\theta\theta}(\theta^\dagger; X)}{n})^{-1} \Prob{\ell_\theta(\theta;X)\ell_\theta(\theta;X)^T}{n}(-\Prob{\ell_{\theta\theta}(\theta^\dagger; X)}{n})^{-1} \overset{p}{\to} \text{Var}\left(\sqrt{n}(\hat{\theta}_n - \theta^\dagger)\right) \end{align}\] where \(\hat{\theta}\) is the MLE.

2.3 Asymptotic confidence intervals

For the most part, we’ll be concerned with univariate confidence intervals, but in multivariate models like the Weibull distribution we’ll need to compute the full inverse of the Fisher information. WLOG, let the index of the parameter of interest be \(1\), so the asymptotic variance of our MLE for the parameter of interest is \(\sigma_1^2(\theta^\dagger) = \mathcal{I}(\theta^\dagger)^{-1}_{1,1}\). We can also define \[\sigma_1^2(\hat{\theta}) = \mathcal{I}(\hat{\theta})^{-1}_{1,1}.\] I’ll also ditch the \(n\) subscript and just let \(\hat{\theta}\) be our MLE based on \(n\) observations. By [eq:cont-map], \[\frac{\sigma_1^2(\hat{\theta})}{\sigma_1^2(\theta^\dagger)} \overset{p}{\to} 1.\] This allows us to use a plug-in estimator for \(\mathcal{I}(\theta^\dagger)^{-1}\), \(\mathcal{I}(\hat{\theta})^{-1}\). \[\begin{align*} \frac{\sqrt{n}(\hat{\theta}_1 - \theta_1^\dagger)}{\sigma_1(\hat{\theta})} & = \frac{\sigma_1(\theta^\dagger)}{\sigma_1(\hat{\theta})}\frac{\sqrt{n}(\hat{\theta}_1 - \theta_1^\dagger)}{\sigma_1(\theta^\dagger)} \\ & \overset{d}{\to} \mathcal{N}(0, 1) \end{align*}\] Using [eq:prod-slutsky], we can create an asymptotic confidence interval by noting that: \[P\left(\frac{\sqrt{n}(\hat{\theta}_1 - \theta^\dagger_1)}{\sigma_1(\hat{\theta})} \leq x\right) = \Phi(x),\] where \(\Phi(x)\) is the CDF a normal distribution with zero mean and unit variance.

Then \[\begin{align*} P\left(\frac{\sqrt{n}(\hat{\theta}_1 - \theta^\dagger_1)}{\sigma_1(\hat{\theta}_1)} \in (-z_{1-\alpha/2},z_{1-\alpha/2})\right) & = P\left(\theta_1^\dagger \in \left(\hat{\theta}_1 - z_{1-\alpha/2} \frac{\sigma_1(\hat{\theta}_1)}{\sqrt{n}}, \hat{\theta}_1 + z_{1-\alpha/2} \frac{\sigma_1(\hat{\theta}_1)}{\sqrt{n}}\right)\right) \end{align*}\]

2.4 Asymptotic tests

2.4.1 Wald test

The Wald test is derived directly from the asymptotic distribution of the MLE. Under the null hypothesis \(\theta^\dagger = \theta_0\), the test statistic: \[\begin{align*} \sqrt{n}(\hat{\theta}_n - \theta_0) \overset{d}{\to} \mathcal{N}(0, \mathcal{I}(\theta_0)^{-1}) \end{align*}\] so \[n (\hat{\theta}_n - \theta_0)^T \mathcal{I}(\theta_0) (\hat{\theta}_n - \theta_0) \sim \chi^2(p)\] This follows from the simple fact that if a random vector in \(\R^n\), \(Z\), is distributed multivariate normal, or \(Z \sim \mathcal{N}(0, \Sigma)\), then \(\Sigma^{-1/2} Z \sim \mathcal{N}(0, I)\), so \(Z^T\Sigma^{-1/2}\Sigma^{-1/2}Z = \sum_{i=1}^n X_i^2\) where \(X_i \sim \mathcal{N}(0,1)\).

2.4.2 Rao’s score test

In our proof of the asymptotic distribution of the MLE, we used the fact that \[\sqrt{n}\frac{1}{n} \sum_{i=1}^n \ell_\theta(\theta^\dagger;X_i) \overset{d}{\to} \mathcal{N}(0, \mathcal{I}(\theta^\dagger)).\] This idea can be used to derive the Rao’s Score test, which uses the fact that under \(H_0: \theta \in \Theta_0\), the gradient evaluated at the restricted MLE (i.e. the MLE restricted to the parameter space \(\Theta_0\)) is nearly zero, and we can recover a similar limiting distribution.

Assuming that under the null distribution the restricted MLE \(\hat{\theta}_0\) is consistent for \(\theta^\dagger \in \Theta_0\), then \[\sqrt{n} \ell_\theta(\hat{\theta}_0) \overset{d}{\to} \mathcal{N}(0, \mathcal{I}(\theta^\dagger))\] The Score test statistic is: \[T_S = \left(\sqrt{n}\ell_\theta(\hat{\theta}_0) \right)^T \mathcal{I}(\hat{\theta}_0)^{-1}\lp \sqrt{n}\ell_\theta(\hat{\theta}_0)\rp\] This test statistic is distribution \(\chi^2(p)\) under \(H_0\).

2.4.3 Likelihood ratio test

The LRT comes from a two-term asymptotic expansion of the log-likelihood, as opposed to the one term expansion: \[\begin{align*} -\ell(\theta_0) & = -\ell(\hat{\theta}_n) - \ell_\theta(\hat{\theta}_n)(\hat{\theta}-\theta_0) - \frac{1}{2}(\hat{\theta}-\theta_0)^T\ell_{\theta\theta}(\tilde{\theta}_n)(\hat{\theta}-\theta_0) \\ \ell(\hat{\theta}) - \ell(\theta_0) & = -\frac{1}{2}(\hat{\theta}-\theta_0)^T\ell_{\theta\theta}(\tilde{\theta}_n)(\hat{\theta}-\theta_0) \\ & = \frac{1}{2}(\sqrt{n}(\hat{\theta}-\theta_0))^T\frac{-\ell_{\theta\theta}(\tilde{\theta}_n)}{n}(\sqrt{n}(\hat{\theta}-\theta_0)) \\ \end{align*}\] As before, \[\sqrt{n}(\hat{\theta}-\theta_0) \overset{d}{\to} \mathcal{N}(0, \mathcal{I}(\theta_0)^{-1})\] and \[-\frac{\ell_{\theta\theta}(\tilde{\theta}_n)}{n} \overset{p}{\to} \mathcal{I}(\theta_0)\] so \[2 (\ell(\hat{\theta}) - \ell(\theta_0)) \overset{d}{\to} \chi^2(p)\]

References

Collett, David. 1994. Modelling Survival Data in Medical Research. Chapman & Hall.
Keener, Robert W. 2010. Theoretical Statistics. Springer Texts in Statistics. New York, NY: Springer New York. https://doi.org/10.1007/978-0-387-93839-4.
Lehmann, E. L., and George Casella. 1998. Theory of Point Estimation. 2nd ed. Springer Texts in Statistics. New York: Springer.
Resnick, Sidney. 2019. A Probability Path. Springer.