Missing data lecture 6
MLEs in repeated measure models
Last class we talked about MLEs for a simple repeated measure model:
\[ \begin{aligned} y_{i} \mid X_{i} \, & = X_{i} \beta + \epsilon_{i} \\ \epsilon_{i} & \sim \text{Normal}(0, \Sigma) \\ \epsilon_{i} & \indy \epsilon_{j} \forall i\neq j \end{aligned} \] Let \(y = (y_1^T, y_2^T, \dots, y_n^T)^T\) and let \(X = (X_1^T, X_2^T, \dots, X_n^T)^T\), and let \(\epsilon = (\epsilon_1^T, \epsilon_2^T, \dots, \epsilon_n^T)^T\). Then the model can be written: \[ \begin{aligned} y \mid X \, & = X \beta + \epsilon \\ \epsilon & \sim \text{Normal}(0, I_n \otimes \Sigma) \end{aligned} \]
We showed that if \(\Sigma\) were known, we could write the MLE for \(\beta\) as: \[ \hat{\beta} = \textstyle(\sum_i X_i^T \Sigma^{-1} X)^{-1} (\sum_i X_i \Sigma^{-1} y_i) \] We can also show that the MLE for \(\Sigma\) if \(\beta\) were known is:
\[ \Sigma = \frac{1}{n} \sum_i (y_i - X_i \beta) (y_i - X_i)^T \] Combining these two fact together, we can iteratively maximize the MLE by doing:
\[ \begin{aligned} \Sigma^{(t+1)} & = \frac{1}{n} \sum_i (y_i - X_i \beta^{(t)}) (y_i - X_i \beta)^T \\ \beta^{(t+1)} & = \textstyle(\sum_i X_i^T (\Sigma^{(t)})^{-1} X_i)^{-1} (\sum_i X_i (\Sigma^{(t)})^{-1} y_i) \end{aligned} \] Which is similar to what the book has.
Inference for MLEs
We’ve shown that we can find a point in parameter space \(\hat{\theta}\) such that there is evidence against any other point \(\theta \neq \hat{\theta}\) being the parameter that generated the data under an assumed statistical model \(f_{Y}(y_i \mid \theta)\).
How do assess the uncertainty in this point estimate? One way to think about this is to consider the distribution of MLEs under different hypothetical datasets.
If we can characterize the distribution, we can build a confidence interval for \(\theta\) that will contain the true value of \(\theta\) for some prescribed proportion of our hypothetical datasets.
Consider the model \(y_i \overset{\text{iid}}{\sim} \text{Normal}(\mu, \sigma^2)\), where \(\sigma^2\) is known one-dimensional normal model. Then we know that \(\hat{\mu} = \bar{y}\), and that its distribution is \(\bar{y} \sim \text{Normal}(\mu, \frac{\sigma^2}{n})\). Using this distribution, we can construct a statistic whose distribution doesn’t depend on any unknown parameters. This is also known as a pivotal quantity. \[ \sqrt{n}(\bar{y} - \mu)/\sigma \sim \text{Normal}(0, 1) \] We know the cumulative distribution function for \(N(0,1)\), \(\Phi(x)\) and we can use this distribution to build a confidence interval for \(\mu\): \[ P(z_{\alpha/2} < \sqrt{n}(\bar{y} - \mu)/\sigma < z_{1-\alpha/2}) \] where \(z_{p} = \Phi^{-1}(p)\) and \(\alpha\) is usually \(0.05\). Now we solve the system of inequalities for \(\mu\) to get our \(1 - \alpha\) confidence interval:
\[ \begin{aligned} P(z_{\alpha/2} < \sqrt{n}(\bar{y} - \mu)/\sigma < z_{1-\alpha/2}) & = P(\frac{\sigma}{\sqrt{n}} z_{\alpha/2} < \bar{y} - \mu < \frac{\sigma}{\sqrt{n}}z_{1-\alpha/2}) \\ & = P(\frac{\sigma}{\sqrt{n}} z_{\alpha/2} - \bar{y} < -\mu < -\bar{y} + \frac{\sigma}{\sqrt{n}}z_{1-\alpha/2}) \\ & = P(\bar{y} - \frac{\sigma}{\sqrt{n}} z_{\alpha/2} > \mu > \bar{y} - \frac{\sigma}{\sqrt{n}}z_{1-\alpha/2}) \\ \end{aligned} \] Because the distribution is symmetric, \(z_{\alpha/2} = -z_{1-\alpha/2}\) which gives us: \[ P(\bar{y} - \frac{\sigma}{\sqrt{n}} z_{1-\alpha/2} < \mu < \bar{y} + \frac{\sigma}{\sqrt{n}}z_{1-\alpha/2}) \] Then the interval \((\bar{y} - \frac{\sigma}{\sqrt{n}} z_{1-\alpha/2}, \bar{y} + \frac{\sigma}{\sqrt{n}}z_{1-\alpha/2})\) will contain \(\mu\) in \(0.95\) of datasets generated under the assumed \(N(\mu, \sigma^2)\).
For all but the simplest models, the sampling distribution is intractable, but we can approximate the distribution using asymptotics.
We’ll need the multivariate central limit theorem, which we’ll just take as a given:
We can use this idea to get a pivotal quantity that involves the MLE for \(\theta\) and the asymptotic variance covariance matrix of the MLE.
Let the log-likelihood be denoted \(\ell(\theta)\), in which we suppress the dependence of the likelihood on the data. If we need to indicate the dependence on data, we’ll write it as \(\ell(\theta; y)\). Finally, denote the gradient of the log-likelihood with respect to \(\theta\) evaluated at \(\theta^\prime\) as: \[ \ell_{\theta}(\theta^\prime; y) \equiv \nabla_\theta \log f_\theta(y)\mid_{\theta = \theta^\prime} \] and the Hessian of the log-likelihood (i.e. the matrix of second derivatives of the log-likelihood) be: \[ \ell_{\theta\theta}(\theta^\prime; y) \equiv \nabla^2_\theta \log f_\theta(y)\mid_{\theta = \theta^\prime} \] Given that \(\theta \in \R^p\), we’ll denote an element of the vector \(\ell_\theta(\theta^\prime)\) as \((\ell_\theta(\theta^\prime))_j\) and a row of \(\ell_{\theta\theta}(\theta^\prime; y)\) as \((\ell_{\theta\theta}(\theta^\prime; y))_j\).
We’ll start with some key assumptions:
- The MLE is consistent for \(\theta\), which means that as we collect more samples the MLE with converge in probability to \(\theta\).
This will rule out the Neyman-Scott problem.
\(y_i\) are iid with density \(f_Y(y_i \mid \theta)\), \(\theta \subseteq \R^d\)
The support of the random variable \(y_i\) doesn’t depend on \(\theta\).
In our earlier presentation of how things can go wrong with MLE, having support that depends on the value of the parameter can make things go awry, so we’ll assume that we’re not in that scenario (like the \(y_i \sim \text{Uniform}(0, \theta)\)).
The true parameter \(\theta\) is in the interior of the parameter space. This ensures that the gradient of the likelihood value at the maximizer \(\hat{\theta}\) will be zero.
Local uniform continuity in \(\theta\) of second derivatives for all \(j\): \[ \sup_{\norm{\theta^\prime - \theta^\dagger} \leq r} \norm{(\ell_{\theta\theta}(\theta^\prime; x))_j - (\ell_{\theta\theta}(\theta^\dagger; x))_j} \leq H_r(x, \theta^\dagger),\, \lim_{r\to 0}\Exp{H_r(X, \theta^\dagger)} = 0 \]
The following proof is cobbled together from Schervish (2012) and Lehmann and Casella (1998).
The gradient of the log-likelihood evaluated at the MLE \(\hat{\theta}\) can be expanded around the true parameter value \(\theta^\dagger\): \[ (\ell_\theta(\hat{\theta}))_j = (\ell_\theta(\theta^\dagger))_j + (\ell_{\theta\theta}(\tilde{\theta}_j))_j(\hat{\theta} + \theta) \] where \(\tilde{\theta}_j \in \R^p\) and lies on the cord between \(\hat{\theta}\) and \(\theta^\dagger\) and may differ by the row \(j\) of \(\ell_\theta(\hat{\theta})\) of which we’re expanding.
We multiply each side by \(\frac{\sqrt{n}}{n}\):
\[ \frac{\sqrt{n}}{n}(\ell_\theta(\hat{\theta}))_j = \frac{1}{n}(\ell_\theta(\theta^\dagger))_j + (\ell_{\theta\theta}(\tilde{\theta}_j))_j\sqrt{n}(\hat{\theta} + \theta) \]
We can show the following given assumption 5: \[ \frac{1}{n}(\ell_{\theta\theta}(\tilde{\theta}_j))_j \overset{p}{\to}\Exp{(\ell_{\theta\theta}(\theta^\dagger; Y))_j} \forall j \] \[ \begin{aligned} \frac{1}{n}(\ell_{\theta\theta}(\tilde{\theta}_j))_j & = \frac{1}{n}(\ell_{\theta\theta}(\theta^\dagger))_j + \frac{1}{n}\lp(\ell_{\theta\theta}(\tilde{\theta}_j))_j - (\ell_{\theta\theta}(\theta^\dagger))_j\rp \end{aligned} \] In order to show that \(\Delta_{n,j} = \frac{1}{n}(\ell_{\theta\theta}(\tilde{\theta}_j))_j - (\ell_{\theta\theta}(\theta^\dagger))_j \overset{p}{\to} 0\), we need to show that \(P(\Delta_{n,j} > \epsilon) \overset{n\to\infty}{\to} 0\).
\(\Delta_{n,j} > \epsilon\) when \(\hat{\theta}_{n}\) is not in \(B_r(\theta^\dagger)\), or the ball centered at \(\theta^\dagger\) of size \(r\) or when \(\frac{1}{n} \sum_{i=1}^n H_r(x_i,\theta^\dagger) > \epsilon\). If we pick \(r\) small enough, we can ensure \(\Exp{H_r(X, \theta^\dagger)} < \epsilon / 2\).
Then \[ \begin{aligned} \Prob{\Delta_{n,j} > \epsilon}{\theta^\dagger} & \leq \Prob{\norm{\theta - \hat{\theta}_n} > r \cup \frac{1}{n}\sum_{i=1}^n H_r(X_i, \theta^\dagger) > \epsilon}{\theta^\dagger} \\ & \leq \Prob{\norm{\theta - \hat{\theta}_n} > r}{\theta^\dagger} \\ & + \Prob{\frac{1}{n}\sum_{i=1}^n H_r(X_i, \theta^\dagger) > \epsilon}{\theta^\dagger} \\ & = \Prob{\norm{\theta - \hat{\theta}_n} > r}{\theta^\dagger} \\ & + \Prob{\frac{1}{n}\sum_{i=1}^n H_r(X_i, \theta^\dagger) - \Exp{H_r(X_i, \theta^\dagger)} > \epsilon - \Exp{H_r(X_i, \theta^\dagger)}}{\theta^\dagger} \\ & \leq \Prob{\norm{\theta - \hat{\theta}_n} > r}{\theta^\dagger} \\ & + \Prob{\abs{\frac{1}{n}\sum_{i=1}^n H_r(X_i, \theta^\dagger) - \Exp{H_r(X_i, \theta^\dagger)}} > \epsilon - \Exp{H_r(X_i, \theta^\dagger)}}{\theta^\dagger} \\ & \leq \Prob{\norm{\theta - \hat{\theta}_n} > r}{\theta^\dagger} \\ & + \Prob{\abs{\frac{1}{n}\sum_{i=1}^n H_r(X_i, \theta^\dagger) - \Exp{H_r(X_i, \theta^\dagger)}} > \epsilon / 2}{\theta^\dagger} \\ \end{aligned} \] By the weak law of large numbers, the second term goes to zero, and the first term goes to zero by consistency. This leaves \(\frac{1}{n}(\ell_{\theta\theta}(\theta^\dagger))_j\) which \(\overset{p}{\to} \Exp{(\ell_{\theta\theta}(\theta^\dagger))_j}\) by the WLLN.
Collecting our \(p\) equations into one set of equations yields: \[ \sqrt{n}\lp\frac{1}{n} \ell_\theta(\theta^\dagger)\rp = (-\Exp{\ell_{\theta\theta}(\theta^\dagger))} + o_p(1)) \sqrt{n}(\hat{\theta}_n - \theta^\dagger) \tag{1}\]
Writing out the expressions as explicit sums: \[\begin{align*} \frac{\sqrt{n}}{n} \sum_{i=1}^n \ell_\theta(\theta^\dagger; x_i) = (-\Exp{\ell_{\theta\theta}(\theta^\dagger))} + o_p(1)) \sqrt{n}(\hat{\theta}_n - \theta^\dagger) \end{align*} \tag{2}\]
The left-hand side of Equation 2 will be amenable to a multivariate version of the CLT. We’ll take the following multivariate CLT as given:
Theorem 1. Multivariate CLT, (Keener 2010) Let \(X_1, X_2, \dots\) be i.i.d random vectors in \(\R^k\) with a common mean \(\Exp{X_i} = \mu\) and common covariance matrix \(\Sigma = \Exp{(X_i - \mu)(X_i - \mu)^T}\). If \(\bar{X} = \frac{\sum_{i=1}^n X_i}{n}\), then \[\sqrt{n}(\bar{X} - \mu) \overset{d}{\to} \text{Normal}(0, \Sigma)\]
Recall that \[ \Exp{\ell_\theta(\theta; X_i)} = 0 \] By the multivariate central limit (MCLT) theorem, Equation 2 converges in distribution to a multivariate normal distribution with mean zero and covariance matrix \(\Exp{\ell_\theta(\theta;X_i)\ell_\theta(\theta;X_i)^T}\).
We’ll also need a lemma about the solutions to random linear equations:
Lemma 1 (Lemma 5.2 in (Lehmann and Casella 1998)) Suppose there are a set of \(p\) equations, \(j = 1, \dots, p\): \[\sum_{k=1}^p A_{jkn} Y_{kn} = T_{jn}.\] Let \(T_{1n}, \dots, T_{pn}\) converge in distribution to \(T_1, \dots, T_p\). Furthermore, suppose that for each \(j,k\), \(A_{jkn} \overset{p}{\to} a_{jk}\) such that the matrix \(A\) with \((j,k)^{\mathrm{th}}\) element \(a_{jk}\) is nonsingular. Then if the distribution of \(T_1, \dots, T_p\) has a disitribution with repsect to the Lebesgue measure over \(\R^p\), \(Y_{1n}, \dots, Y_{pn}\) tend in probability to \(A^{-1} T\).
Written in matrix form and using the fact that convergence in probability implies convergence in distribution: \[ T_n = A_n Y_n \implies Y_n \overset{d}{\to} A^{-1}T \]
We have that the left-hand side of Equation 1 converges in distribtuion to a multivariate normal distribution, and we have that the matrix on the RHS of Equation 1 convegens in probability to \(-\Exp{\ell_{\theta\theta}(\theta^\dagger))}\), which by assumption is invertble. Thus by Lemma 1 \(\sqrt{n}(\hat{\theta}_n - \theta^\dagger)\) converges in probability to \[\begin{align*} \sqrt{n}(\hat{\theta}_n - \theta^\dagger) \overset{p}{\to} (-\Exp{\ell_{\theta\theta}(\theta^\dagger; X_i)})^{-1} \Exp{\ell_\theta(\theta;X_i)\ell_\theta(\theta;X_i)^T}^{1/2}\mathcal{Z} \end{align*}\] where \(\mathcal{Z} \sim \text{Normal}(0, I_p)\), or \[\begin{align*} &\sqrt{n}(\hat{\theta}_n - \theta^\dagger) \overset{d}{\to} \mathcal{N}\left(0, (-\Exp{\ell_{\theta\theta}(\theta^\dagger; X_i)})^{-1} \Exp{\ell_\theta(\theta;X_i)\ell_\theta(\theta;X_i)^T}(-\Exp{\ell_{\theta\theta}(\theta^\dagger; X_i)})^{-1}\right) \end{align*}\]
Assuming further that \[ \Exp{\ell_{\theta\theta}(\theta^\dagger; X_i)} + \Exp{\ell_\theta(\theta;X_i)\ell_\theta(\theta;X_i)^T}=0 \implies (-\Exp{\ell_{\theta\theta}(\theta^\dagger; X_i)})^{-1}\Exp{\ell_\theta(\theta;X_i)\ell_\theta(\theta;X_i)^T} = I_p \] Putting this all together shows that \[\begin{align*} \sqrt{n}(\hat{\theta}_n - \theta^\dagger) \overset{d}{\to} \mathcal{N}(0, \mathcal{I}(\theta^\dagger)^{-1}) \end{align*}\] where \(\mathcal{I}(\theta^\dagger) = -\Exp{\ell_{\theta\theta}(\theta^\dagger; X_i)}\)
Then we can build an asymptotic confidence interval using the pivotal quantity strategy we had above:
\[ P(\bar{y} - \frac{\sigma}{\sqrt{n}} z_{1-\alpha/2} < \mu < \bar{y} + \frac{\sigma}{\sqrt{n}}z_{1-\alpha/2}) \] But with \(\bar{y}\) equaling \(\hat{\theta}_1\) and \(\sigma = \sqrt{\mathcal{I}(\hat{\theta})^{-1}_{1,1}}\). It is sometimes hard to calculate the Fisher information because it involves taking expectations of the negative Hessian of the log-likelihood function. If we’d prefer, we can instead use an estimator for \(\mathcal{I}(\hat{\theta})\) as
\[
\mathcal{I}(\hat{\theta}) = -\frac{1}{n} \sum_i \ell_{\theta\theta}(\hat{\theta}; y_i)
\] The \(\mathcal{I}\) should have a hat over it, but I can’t get the math to compile correctly when I add the \hat over it.
There are ways to build multivariate confidence intervals, but I won’t go over those right now, though they are covered in the book in Chapter 6.
Bayes
The machinery for Frequentist inference often relies on asymptotic arguments for complex models. Bayesian inference, on the other hand, does not, and gives exact finite sample inference. There are caveats though, which we’ll cover.
MLEs are concerned with finding a single point that, in some sense agrees with the dataset at hand, though our inference depends on hypothetical replications of the experiment that would generate alternative datasets. Bayesian inference requires that we characterize the distribution of parameters that agree with the dataset at hand.
The idea of Bayesian inference starts with Bayes rule. Given a prior distribution \(p(\theta)\) we can combine that with the observational density of data \(f_Y(y \mid \theta)\) to give us an updated distribution \(p(\theta \mid y)\) given the dataset at hand: \[ p(\theta \mid y) = \frac{f_Y(y \mid \theta) p(\theta)}{\int_{\Omega_\theta}f_Y(y \mid \theta) p(\theta) d \theta} \] The nice thing about Bayesian inference is that we get a distribution over \(\theta\) given the dataset that we can now use to make probability statements about \(\theta\). The statement With 0.95 probability \(\theta\) is in the interval \(C\) is just a manipulation of the posterior density: \(P(\theta \in C \mid y) = \int_C p(\theta \mid y) d\theta\).
Let’s look at a specific example: the standard \(y_i \overset{\text{iid}}{\sim} \text{Bernuolli}(\theta)\) with \(\theta \sim \text{Beta}(\alpha, \beta)\).
The likelihood is:
\[ \prod_i \theta^{y_i}(1 - \theta)^{1 - y_i} = \theta^{\sum_i y_i}(1 - \theta)^{n - \sum_i y_i} \] The prior for \(\theta\) will be:
\[ \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \theta^{\alpha - 1}(1 - \theta)^{\beta - 1} \] The numerator the posterior is the product of these two expressions, where we let \(s=\sum_i y_i\) for convenience:
\[ \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \theta^{s + \alpha - 1}(1 - \theta)^{n - s + \beta - 1} \] Integrating over \(\theta\) will give us the denominator of our expression:
\[ \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \frac{\Gamma(\alpha + s)\Gamma(\beta + n - s)}{\Gamma(\alpha + \beta + n)} \] The ratio of the numerator and the denominator gives us: \[ \frac{\frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \theta^{s + \alpha - 1}(1 - \theta)^{n - s + \beta - 1}}{ \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \frac{\Gamma(\alpha + s)\Gamma(\beta + n - s)}{\Gamma(\alpha + \beta + n)} } \] which simplifies to: \[ \frac{\Gamma(\alpha + \beta + n)}{\Gamma(\alpha + s)\Gamma(\beta + n - s)}\theta^{s + \alpha - 1}(1 - \theta)^{n - s + \beta - 1} \] This is just the Beta distribution with updated coefficients.
The prior mean is \(\frac{\alpha}{\alpha + \beta}\), while the posterior mean is \(\frac{s + \alpha}{n + \alpha + \beta}\). We can rewrite this to get a better understanding of the posterior mean represents in this circumstance: \[
\begin{aligned}
\frac{s + \alpha}{n + \alpha + \beta} = \frac{\alpha}{\alpha + \beta}\frac{\alpha + \beta}{n + \alpha + \beta} + \lp 1 - \frac{\alpha + \beta}{n + \alpha + \beta}\rp \frac{s}{n}
\end{aligned}
\] This shows that the posterior mean is a weighted average of the prior mean and the data mean. This sort of exemplifies what we would hope for from Bayesian inference, some adjudication between the prior and the data. The posterior distribution is a Beta distribution, so we can get probability statements easily by using qbeta in R.
We didn’t have to go through all of the marginalization above. We could have noticed that the kernel of the posterior, namely the expression that depends on the unknown parmaeters, had a familiar form: \[ p(\theta \mid s) \propto \theta^{s + \alpha - 1}(1 - \theta)^{n - s + \beta - 1} \] This is the kernel of the beta distribution, so we could have just stopped here and said that \[ p(\theta \mid s) \equiv \text{Beta}(\alpha + s, \beta + n - s) \] This procedure is aided by conjugate priors, which match the likelihood in a way; the functional form of the prior slots into the way the parameters are expressed in the likelihood to yield a family of posteriors that are in the same family as the prior.
These probabilities are strictly “right” under our prior assumption because of the math of Bayes’ theorem. Whether or not these are useful is another question.
There are some downsides to using a prior. MLEs have a nice property called invariance, namely that if \(\hat{\theta}\) is the MLE then the MLE for a function of \(\theta\), say \(g(\theta)\), is just \(g(\hat{\theta})\). This isn’t true in Bayesian inference generally, because we’re now dealing with distributions rather than points. So usually the posterior for \(\theta\) won’t be the same as the posterior for \(g(\theta)\): Let \(\eta = g(\theta)\), and assume for simplicity’s sake that \(g\) is one-to-one. Then \(\theta = g^{-1}(\eta)\). If \(\theta\) has posterior \(p(\theta \mid y)\), the posterior for \(g(\theta)\) is:
\[ p(g^{-1}(\eta) \mid y) \det \nabla_{\eta} g^{-1}(\eta) \] In fact, this is true of priors too! Given \(p(\theta)\) and a transformation \(\eta = g(\theta)\) the prior for \(\eta\) is: \[ p(\eta) = p(g^{-1}(\eta)) \det \nabla_{\eta} g^{-1}(\eta) \] This is somewhat problematic if you think about how to represent ignorance. Let’s say you want to learn about a parameter \(\theta \in [0,1]\). The simplest prior for this is the flat prior \(p(\theta) \propto 1\). That implies that \(\theta\) is equally likely to be anywhere in the interval of \([0,1]\). But what does that imply about \(\eta = \theta^2\)? Well on \([0,1]\), \(g(x) =x^2\) is one-to-one, and the inverse is \(\sqrt{g(\eta)} = \theta\). The derivative of \(\sqrt{\eta}\) is proportional to \(\eta^{-1/2}\), which implies a downward sloping distribution on \([0,1]\). But why would you have knowledge of \(\theta^2\) without knowledge of \(\theta\)? It seems counterintuitive.
The dyed-in-the-wool Bayesian would argue that there is no such thing as true ignorance, and that the problem that you face will have consequences for where you expect your parameter to lie. Suppose you’re modeling the proportion of Corvallis residents with synovial sarcoma, which is a very rare cancer. You’re probably going to use a prior that favors small values of \(\theta\).
What if instead you’re looking at the proportion of rainy days in Corvallis from November to April? You’d probably use a prior that at the very least avoided \(\theta\) near 0.