Missing data lecture 4
Likelihood-based inference
This bit of the notes follows 6.1 pretty closely.
For the moment, let’s forget about missing data, and move to the simpler case where we have no missing observations and we’re just focused on learning the unknown parameters \(\theta\) in the density \(f_Y(y_i \mid \theta)\). This unknown parameter is going to live in the space \(\Omega_\theta\). For example, if \(\theta = (\mu, \sigma^2)\) and \(f_Y(y_i \mid \theta) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{1}{2\sigma^2} (y_i - \mu)^2}\), then \(\Omega_\theta\) would be \((\R, \R^+)\).
We know densities are functions of \(y_i\) for fixed values of \(\theta\). If we instead fix \(y_i\) and let \(\theta\) vary, we get a likelihood function.
Let the likelihood be defined for a fixed \(y_i\) as \[ L_Y(\theta \mid y_i) \propto \begin{cases} f(y_i \mid \theta) & \theta \in \Omega_\theta \\ 0 & \theta \not\in \Omega_\theta \end{cases} \] This is a bit odd, because the expression means that anything proportional to the density of \(Y\) such that the factor by which \(L_Y(\theta \mid y_i)\) differs from \(f(y_i \mid \theta)\) is constant in \(\theta\).
Typically, we’ll have more than one observation, and under independence of \(y_i\) we get the likelihood for the full sample: \[ L_Y(\theta \mid y_1, \dots, y_n) \propto \begin{cases} \prod_{i=1}^n f(y_i \mid \theta) & \theta \in \Omega_\theta \\ 0 & \theta \not\in \Omega_\theta \end{cases} \] We’ll often work with the log-likelihood, which is denoted in our book as: \[ \ell_Y(\theta \mid y_1, \dots, y_n) = \log(L_Y(\theta \mid y_1, \dots, y_n)) \] When we have independence, we get
\[ \ell_Y(\theta \mid y_1, \dots, y_n) = \sum_{i=1}^n \log f(y_i \mid \theta) + C \] where \(C\) is any term that doesn’t depend on \(\theta\).
Moving forward with the normal example from above, the likelihood of \(n\) observations from the normal distribution is: \[ \begin{aligned} L_Y(\mu, \sigma^2 \mid y_1, \dots, y_n) & = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{1}{2\sigma^2} (y_i - \mu)^2} \\ & = (2\pi \sigma^2)^{-\frac{n}{2}} \exp \lp -\frac{1}{2\sigma^2} \textstyle\sum_i(y_i - \mu)^2\rp \end{aligned} \] Our definition of likelihood means that we can drop the factor of \((2\pi)^{-\frac{n}{2}}\) from the front of the expression. Taking logs and dropping that constant leads to the log-likelihood:
\[ \ell_Y(\mu, \sigma^2) = -\frac{n}{2} \log \sigma^2 -\frac{1}{2\sigma^2} \textstyle\sum_i(y_i - \mu)^2 \] which is a bivariate function of \(\mu\) and \(\sigma^2\).
The way that we’ll use the likelihood is to use the intuition that for two parameter values \(\theta^\prime\) and \(\theta^{\prime\prime}\) if \(L(\theta^\prime \mid y) = 2L(\theta^{\prime\prime} \mid y)\), then there is evidence against \(\theta^{\prime\prime}\) being the parameter that generated the dataset.
If we were to find a \(\hat{\theta}\) such that \(L_Y(\hat{\theta} \mid y) \geq L_Y(\theta^* \mid y)\) for all \(\theta* \neq \hat{\theta}\), then this would be evidence against \(\theta\) being anything other than \(\hat{theta}\). This is the intuition behind maximum likelihood, or finding the value of the unknown parameters \(\theta\) that maximizes the likelihood function (or, equivalently, the log-likelihood).
We’ll say that the maximum likelihood estimator (MLE) of \(\theta\) is the value \(\hat{\theta} \in \Omega_\theta\) that maximizes the log-likelihood.
In order to maximize the likelihood, we need to find the point at which the gradient of the log-likelihood, also known as the score function, is zero, or \(\frac{\partial\ell_Y(\theta \mid y)}{\partial \theta}\mid_{\theta = \hat{\theta}} = 0\). If \(\theta\) is \(d\)-dimensional, then there are \(d\) equations that need to be solved. In addition, the Hessian of the log-likelihood needs to be checked to see if it is negative semidefinite at the MLE: \[ z^T \frac{\partial^2 \ell_Y(\theta \mid y)}{\partial \theta \,\partial \theta^T}\mid_{\theta = \hat{\theta}} z \leq 0 \,\forall\, z \in \R^d \] This ensures that we’ve found a maximum. Note that we can have several maxima, all of which lead to the same log-likelihood value.
Simple MLE example
Let’s say that \(y_i \overset{\text{iid}}{\sim} \text{Exp}(\lambda)\). Then the likelihood will be \[ L(\lambda \mid y_1, \dots, y_n) = \lambda^{-n} e^{-\frac{1}{\lambda}\sum_i y_i} \] Then the score equation is:
\[ -\frac{n}{\lambda} + \sum_i \frac{y_i}{\lambda^2} = 0 \] This leads to \(\hat{\lambda} = \bar{y}\) or the sample mean.
More involved MLE example: Normal with unknown mean and variance
Suppose \(y_i \overset{\text{iid}}{\sim} \text{Normal}(\mu, \sigma^2)\) for \(i = 1, \dots, n\).
We wrote the log-likelihood above as: \[ \ell_Y(\mu, \sigma^2) = -\frac{n}{2} \log \sigma^2 -\frac{1}{2\sigma^2} \textstyle\sum_i(y_i - \mu)^2 \]
This leads to two likelihood equations:
\[ \begin{aligned} \frac{\partial \ell}{\partial \mu } & = \frac{1}{\sigma^2} \textstyle\sum_i(y_i - \mu)^2 \\ \frac{\partial \ell}{\partial \sigma^2 } & =-\frac{n}{2\sigma^2} + \frac{1}{2\sigma^4} \textstyle\sum_i(y_i - \mu)^2 \end{aligned} \] Setting these to zero and solving for \((\mu, \sigma^2)\) gives the MLEs \(\hat{\mu} = \bar{y}\) and \(\hat{\sigma}^2 = \frac{1}{n}\sum_i (y_i - \bar{y})^2\).
It’s straightforward to show that \(\Exp{\hat{\mu}}\) under the data generating process above is equal to \(\mu\). It is less straightforward and somewhat distressing that \(\Exp{\hat{\sigma^2}}\) not equal to \(\sigma^2\).
\[ \begin{aligned} \Exp{\hat{\sigma^2}} & = \frac{1}{n}\Exp{\sum_i (y_i^2 - 2 y_i \bar{y} + \bar{y}^2)} \\ & = \frac{1}{n}\lp \Exp{\textstyle\sum_i y_i^2 - 2 n \bar{y}^2 + n\bar{y}^2 \rp}\\ & = \frac{1}{n}\textstyle\sum_i \Exp{y_i^2} - \Exp{\bar{y}^2} \\ & = \mu^2 + \sigma^2 - \frac{1}{n^2}\Exp{\textstyle \sum_i y_i^2 + 2 \sum_{i<j} y_i y_j} \\ & = \mu^2 + \sigma^2 - \frac{1}{n^2}\lp \textstyle \sum_i \Exp{y_i^2} + 2 \sum_{i<j} \Exp{y_i y_j}\rp \\ & = \mu^2 + \sigma^2 - \frac{1}{n^2} \lp n (\mu^2 + \sigma^2) + n(n-1) \mu^2\rp \\ & = \mu^2 + \sigma^2 - \mu^2 - \frac{1}{n}\sigma^2 \\ & = \frac{n-1}{n} \sigma^2 \end{aligned} \] Sorta odd that MLEs can give us biased estimates, but, you might say, as \(n\to\infty\) all is fine and we recover \(\sigma^2\). Indeed, you can show that the MLE for \(\sigma^2\) is , which means that as \(n\to\infty\) \(\hat{\sigma}^2 \to \sigma\) in probability.
More complicated still: Neyman-Scott problem
Is this always the case, that the MLE is consistent? The answer is no, and the MLE can fail pretty spectacularly. Consider the following problem:
\[ y_{i1}, y_{i2} \sim \text{Normal}(\mu_i, \sigma^2) \] Let’s say that we don’t care about inferring \(\mu_i\) but we care only about \(\sigma^2\). One way we could come up with an estimator is to difference the two observations for each group, so \(z_i = y_{i1} - y_{i2}\) and : \[ z_i \sim \text{Normal}(0, 2\sigma^2) \] Then we could reuse our work from above to solve for \(\sigma^2\):
\[ \begin{aligned} 2 \sigma^2 & = \frac{1}{n} \textstyle \sum_i z_i^2 \\ \end{aligned} \] which leads to an estimator for \(\sigma^2\) of: \[ \hat{\sigma}^2 = \frac{1}{2n} \textstyle \sum_i z_i^2 \]
But this isn’t quite maximum likelihood because we’ve transformed the data, and then used the transformed data to derive an MLE for \(\sigma^2\) using \(z_i\). What if we just use \(y_{i1},y_{i2}\)?
The likelihood is straightforward:
\[ L_Y(\{\mu_i\}, \sigma^2 \mid y) = \prod_{i=1}^n \frac{1}{2 \pi \sigma^2} \exp \lp -\frac{1}{2\sigma^2} \lp (y_{i1} - \mu_i)^2 + (y_{i2} - \mu_i)^2 \rp \rp \]
The likelihood equations for \(\mu_i\) are \[
\frac{\partial \ell_Y(\{\mu_i\}, \sigma^2 \mid y)}{\partial \mu_i} = \frac{1}{\sigma^2}\lp (y_{i1} - \mu_i) + (y_{i2} - \mu_i) \rp
\] Which just leads to the estimator \(\hat{\mu}_i = \frac{y_{i1} + y_{i2}}{2} = \bar{y}_i\).
Let’s write out the likelihood equations for \(\sigma^2\) after plugging in our \(\hat{\mu}_i = \bar{y}_i\): \[
\frac{\partial \ell_Y(\{\mu_i\}, \sigma^2 \mid y)}{\partial \sigma^2} = -\frac{n}{\sigma^2} + \frac{1}{2 \sigma^2}\textstyle\sum_i \sum_{j=1}^2 (y_{ij} - \bar{y}_i)^2
\] This leads to an MLE of
\[ \hat{\sigma}^2 = \frac{1}{2n}\textstyle\sum_i \sum_{j=1}^2 (y_{ij} - \bar{y}_i)^2 \] Which seems reasonable enough. But let’s write the inner sum in terms of \(z_i\): \[ \begin{aligned} (y_{i1} - \bar{y}_i)^2 + (y_{i2} - \bar{y}_i)^2 & = y_{i1}^2 + y_{i2}^2 + 2\bar{y}_i^2 - 2 \bar{y_i}(y_{i1} + y_{i2}) \\ & = y_{i1}^2 + y_{i2}^2 + \frac{(y_{i1} + y_{i2})^2}{2} - (y_{i1} + y_{i2})^2 \\ & = y_{i1}^2 + y_{i2}^2 - \frac{(y_{i1} + y_{i2})^2}{2} \\ & = \frac{1}{2}(y_{i1}^2 + y_{i2}^2 - 2 y_{i1}y_{i2}) \\ & = \frac{1}{2} z_i^2 \end{aligned} \]
This means that the MLE is equal to \[ \hat{\sigma}^2 = \frac{1}{4n}\textstyle\sum_i z_i^2 \] This estimator will never approach \(\sigma^2\), as \(n\to\infty\).
This is due to the fact that the dimension of the parameter space grows linearly with the sample size as \(n\to\infty\).
The point is that MLEs (and any other sort of likelihood-based inference) can lead you astray if you’re not careful, and they are not a panacea.