Missing data lecture 11: More EM

One more EM convergence result

Another result is that if \(\theta^{t+1}\) is chosen such that:

\(\frac{\partial}{\partial \theta} Q(\theta \mid \theta^t) \mid_{\theta = \theta^{t+1}} = 0\)
\(\theta^{t+1} \to \theta^\star\)
\(f(Y_{(1)} \mid Y_{(0)} = \tilde{y}_{(0)}, \theta)\) is sufficiently smooth in \(\theta\)

Then

\(\frac{\partial}{\partial \theta} \ell(\theta \mid Y_{(0)} = \tilde{y}_{(0)}) \mid_{\theta = \theta^\star} = 0\)

which means that the value to which the step-by-step maximization of the \(Q(\theta \mid \theta^t)\) functions converges, \(\theta^\star\), will also be a stationary point of the observed data likelihood, \(\ell(\theta \mid Y_{(0)} = \tilde{y}_{(0)})\)

\[ \begin{aligned} \frac{\partial}{\partial \theta} \ell(\theta \mid Y_{(0)} = \tilde{y}_{(0)}) \mid_{\theta = \theta^\star} = \frac{\partial}{\partial \theta} Q(\theta \mid \theta^{t}) \mid_{\theta = \theta^\star} - \frac{\partial}{\partial \theta} H(\theta \mid \theta^{t}) \mid_{\theta = \theta^\star} \end{aligned} \] The first quantity on the RHS is zero by the condition that we pick \(\theta^{t+1}\) as that which leads to \(\frac{\partial}{\partial \theta} Q(\theta \mid \theta^{t}) \mid_{\theta = \theta^\star}=0\), so if \(\theta^t \to \theta^\star\), we must have gotten there by setting these gradients equal to zero. The second quantity on the RHS is \[ \begin{aligned} \frac{\partial}{\partial \theta} H(\theta \mid \theta^{t}) \mid_{\theta = \theta^\star} & = \frac{\partial}{\partial \theta} \int_{\mathcal{Y}_{(1)}} \log f(Y_{(1)} = y_{(1)} \mid Y_{(0)} = \tilde{y}_{(0)}, \theta) f(Y_{(1)} = y_{(1)} \mid Y_{(0)} = \tilde{y}_{(0)}, \theta^t) dy_{(1)} \mid_{\theta=\theta^\star} \\ & = \int_{\mathcal{Y}_{(1)}} \frac{\frac{\partial}{\partial \theta}f(Y_{(1)} = y_{(1)} \mid Y_{(0)} = \tilde{y}_{(0)}, \theta)\mid_{\theta=\theta^\star}}{f(Y_{(1)} = y_{(1)}\mid Y_{(0)} = \tilde{y}_{(0)}, \theta^\star)} f(Y_{(1)} = y_{(1)}\mid Y_{(0)} = \tilde{y}_{(0)}, \theta^t) dy_{(1)} \mid_{\theta=\theta^\star} \\ \end{aligned} \] as \(\theta^t \to \theta^\star\) the denominators cancel, leaving

\[ \int_{\mathcal{Y}_{(1)}} \frac{\partial}{\partial \theta}f(Y_{(1)} = y_{(1)} \mid Y_{(0)} = \tilde{y}_{(0)}, \theta)\mid_{\theta=\theta^\star}dY_{(1)} \] which, if we pull the derivative out of the integral again, equals zero because we’re differentiating a constant.

Nontrivial EM example

Here’s a nontrivial example: Let’s say we have three outcomes, so \(Y\) is an \(n \times 3\) matrix, and \(Y_i\) are distributed trivariate normal with means, \(\mu_1, \mu_2, \mu_3\) and covariance matrix \(\Sigma\). For simplicity’s sake, we assume the missingness is ignorable, and assume we have only three missingness patterns: \[ m_i \in \{(0, 0, 1), (1, 0, 0), (0, 0, 0)\}. \] The parameters of interest are \(\mu_1, \mu_2, \mu_3, \Sigma\), and we’d like to use all the available data to do inference. Let the three groups that have missingness be defined as: \[ \begin{aligned} r_1 & = \# (y_{1i} \text{ obs, }, y_{2i} \text{ obs }, y_{3i} \text{ miss. })\\ r_2 & = \# (y_{1i} \text{ miss, }, y_{2i} \text{ obs }, y_{3i} \text{ obs }) \\ n - r_1 - r_3 & = \# (y_{1i} \text{ obs, }, y_{2i} \text{ obs, }, y_{3i} \text{ obs }) \end{aligned} \], and suppose we have arranged our indices \(i\) so \(i \in \{1, \dots, r_1\}\) have \(y_1\) observed, \(i \in \{r_1 + 1, \dots, r_1 + r_2\}\) have \(y_2\) observed and \(i \in \{r_1+r_2+1, \dots, n\}\) have all data observed. z Let \(f_{Y_j, Y_k}((y_{ji},y_{ki}) \mid (\mu_j, \mu_k), \Sigma_{jk}), j = 1, 2\) be the bivariate normal density for indices \(j,k, j\neq k\), where \[\Sigma_{jk} = \Exp{((Y_j, Y_k)^T - (\mu_j, \mu_k)^T)((Y_j, Y_k) - (\mu_j, \mu_k))}\] while \(f_{Y}(y_{i} \mid \mu_1, \mu_2, \mu_3 \Sigma)\) is the trivariate normal density. The observed data likelihood is \[ \begin{aligned} L(\mu_1, \mu_2, \mu_3 \Sigma, \mid Y_{(0)}) & = \prod_{i=1}^{r_1} f_{Y_1, Y_2}(y_{1i}, y_{2i} \mid \mu_1,\mu_2, \Sigma_{12}) \\ &\prod_{i=r_1+1}^{r_1+r_2} f_{Y_2,Y_3}(y_{2i}, y_{3i} \mid \mu_2, \mu_3, \Sigma_{23}) \\ &\prod_{i=r_1+r_2+1}^{n} f_{Y}(y_{i} \mid \mu_1, \mu_2, \mu_3, \Sigma) \end{aligned} \] This is going to be a hard function to maximize in terms of \(\Sigma\). We can find the MLEs for this expression, but it isn’t standard, and it’ll involve some thinking, whereas if we had a complete dataset we could just do the maximization very easily.

If we had a complete dataset, we know from an earlier lecture the ML solutions for these quantities: \[ \hat{\mu}_1 = \bar{y}_1,\,\hat{\mu}_2 = \bar{y}_2,\,\hat{\mu}_3 = \bar{y}_3,\, \hat{\Sigma} = 1/n\sum_i y_i y_i^T - \hat{\mu}\hat{\mu}^T \]

How can we get an MLE when there is ignorable missingness as described above?

Multivariate normal with missingness

We can run the EM algorithm, which will require computing the \(Q(\theta \mid \theta^t)\) function. For the multivariate normal example, this is: \[ \Exp{\ell_{Y}(\mu, \Sigma \mid Y) \mid Y_{(0)}, \mu^t, \Sigma^t} = \frac{1}{2} \log (\det\Sigma^{-1}) -\frac{1}{2}\text{tr}\sum_i\Exp{\lp(y_i - \mu) (y_i - \mu)^T \mid Y_{(0)}, \mu^t, \Sigma^t}\Sigma^{-1}\rp \] If we expand out the cross product, we see we need \[ \Exp{(y_i - \mu) (y_i - \mu)^T \mid Y_{(0)}, \mu^t, \Sigma^t} = \Exp{y_i y_i^T \mid Y_{(0)}, \mu^t, \Sigma^t} - \Exp{\mu y_i^T - y_i^T \mu \mid Y_{(0)}, \mu^t, \Sigma^t} - \mu\mu^T \] The first term is : \[ \Exp{y_i y_i^T \mid Y_{(0)}, \mu^t, \Sigma^t} = \text{Cov}(y_i \mid Y_{(0)}, \mu^t, \Sigma^t) + \Exp{y_i \mid Y_{(0)}, \mu^t, \Sigma^t}\Exp{y_i \mid Y_{(0)}, \mu^t, \Sigma^t}^T \] Plugging this back in above gives:

\[ \text{Cov}(y_i \mid Y_{(0)}, \mu^t, \Sigma^t) + \Exp{y_i \mid Y_{(0)}, \mu^t, \Sigma^t}\Exp{y_i \mid Y_{(0)}, \mu^t, \Sigma^t}^T - \Exp{\mu y_i^T - y_i^T \mu \mid Y_{(0)}, \mu^t, \Sigma^t} - \mu\mu^T \] simplifying to \[ \text{Cov}(y_i \mid Y_{(0)}, \mu^t, \Sigma^t) + (\Exp{y_i \mid Y_{(0)}, \mu^t, \Sigma^t} - \mu )(\Exp{y_i \mid Y_{(0)}, \mu^t, \Sigma^t} - \mu )^T \] Pluggig this back in above, we get the expected log-likelihood \[ \begin{aligned} & \Exp{\ell_{Y}(\mu, \Sigma \mid Y) \mid Y_{(0)}, \mu^t, \Sigma^t} = \frac{1}{2} \log (\det\Sigma^{-1}) \\ & -\frac{1}{2}\text{tr}\sum_i\lp\text{Cov}(y_i \mid Y_{(0)}, \mu^t, \Sigma^t) + (\Exp{y_i \mid Y_{(0)}, \mu^t, \Sigma^t} - \mu )(\Exp{y_i \mid Y_{(0)}, \mu^t, \Sigma^t} - \mu )^T\rp\Sigma^{-1} \end{aligned} \] This leads to the M step estimates:

\[ \mu^{t+1} = \frac{1}{n} \sum_i \Exp{y_i \mid Y_{(0)}, \mu^t, \Sigma^t}, \]

and for

\[ \Sigma^{t+1} = \frac{1}{n}\sum_i\text{Cov}(y_i \mid \tilde{y}_{i(0)}, \mu^t, \Sigma^t) + (\Exp{y_i \mid \tilde{y}_{i(0)}, \mu^t, \Sigma^t} - \mu^{t+1} )(\Exp{y_i \mid \tilde{y}_{i(0)}, \mu^t, \Sigma^t} - \mu^{t+1} )^T \]

The key is that the conditional expectation and covariances for \(y_i\) are informed by the observed data.

Suppose our observation is in the first group, so \(m_i = (0, 0, 1)\). Then we need to compute \(\text{Cov}(y_i \mid y_{i1}, y_{i2}, \mu^t, \Sigma^t)\), which is

\[ \begin{bmatrix} \text{Var}(y_{i1} \mid y_{i1}, y_{i2}, \mu^t, \Sigma^t) & \text{Cov}(y_{i1}, y_{i2} \mid y_{i1}, y_{i2}, \mu^t, \Sigma^t) & \text{Cov}(y_{i1}, y_{i3} \mid y_{i1}, y_{i2}, \mu^t, \Sigma^t) \\ \text{Cov}(y_{i1}, y_{i2} \mid y_{i1}, y_{i2}, \mu^t, \Sigma^t) & \text{Var}(y_{i2} \mid y_{i1}, y_{i2}, \mu^t, \Sigma^t) & \text{Cov}(y_{i2}, y_{i3} \mid y_{i1}, y_{i2}, \mu^t, \Sigma^t) \\ \text{Cov}(y_{i1}, y_{i3} \mid y_{i1}, y_{i2}, \mu^t, \Sigma^t) & \text{Cov}(y_{i2}, y_{i3} \mid y_{i1}, y_{i2}, \mu^t, \Sigma^t) & \text{Var}(y_{i3} \mid y_{i1}, y_{i2}, \mu^t, \Sigma^t) \end{bmatrix} \] Most of the elements of this matrix are zero, because of the properties of conditional expectation. Look at \(\text{Cov}(y_{i1}, y_{i3} \mid y_{i1}, y_{i2}, \mu^t, \Sigma^t)\), for instance: \[ \text{Cov}(y_{i1}, y_{i3} \mid y_{i1}, y_{i2}, \mu^t, \Sigma^t) = \Exp{y_{i1} y_{i3} \mid y_{i1}, y_{i2}, \mu^t, \Sigma^t} - \Exp{y_{i1} \mid y_{i1}, y_{i2}, \mu^t, \Sigma^t}\Exp{y_{i3} \mid y_{i1}, y_{i2}, \mu^t, \Sigma^t} \] \[ \Exp{y_{i1} y_{i3} \mid y_{i1}, y_{i2}, \mu^t, \Sigma^t} = y_{i1} \Exp{y_{i3} \mid y_{i1}, y_{i2}, \mu^t, \Sigma^t} \] and similarly

\[ \Exp{y_{i1} \mid y_{i1}, y_{i2}, \mu^t, \Sigma^t} = y_{i1} \]

so we get \[ \begin{aligned} \text{Cov}(y_{i1}, y_{i3} \mid y_{i1}, y_{i2}, \mu^t, \Sigma^t) = & y_{i1} \Exp{y_{i3} \mid y_{i1}, y_{i2}, \mu^t, \Sigma^t} - y_{i1} \Exp{y_{i3} \mid y_{i1}, y_{i2}, \mu^t, \Sigma^t} \\ = & 0 \end{aligned} \] The same holds for the conditional variances. The only nonzero element is \(\text{Var}(y_{i3} \mid y_{i1}, y_{i2}, \mu^t, \Sigma^t)\). We can use the properties of the conditional normal distribution to derive the conditional variance.

Let our covariance matrix \(\Sigma\) be defined: \[ \Sigma = \begin{bmatrix} \Sigma_{12} & \begin{bmatrix} \sigma_{13} \\ \sigma_{23} \end{bmatrix} \\ \begin{bmatrix} \sigma_{13} & \sigma_{23} \end{bmatrix} & \sigma^2_3 \end{bmatrix}\,, \Sigma_{12} = \begin{bmatrix} \sigma_{1}^2 & \sigma_{12} \\ \sigma_{12} & \sigma_{2}^2 \end{bmatrix} \] Then the conditional variance, evaluated at \(\mu^{(t)}, \Sigma^{(t)}\) is: \[ (\sigma^2_3)^{(t)} - \begin{bmatrix} \sigma^{(t)}_{13} & \sigma^{(t)}_{23} \end{bmatrix} (\Sigma_{12}^{(t)})^{-1}\begin{bmatrix} \sigma^{(t)}_{13} \\ \sigma^{(t)}_{23} \end{bmatrix} \] The conditional mean, \(\Exp{y_{i3} \mid y_{i1}, y_{i2}, \mu^t, \Sigma^t}\) is \[ \mu_3^{(t)} + \begin{bmatrix} \sigma^{(t)}_{13} & \sigma^{(t)}_{23} \end{bmatrix} (\Sigma_{12}^{(t)})^{-1}\begin{bmatrix} y_{i1} - \mu_1^{(t)} \\ y_{i2} - \mu_2^{(t)} \end{bmatrix} \] Similar formulas can be used for the second group, where you have observed \(y_{i2},y_{i3}\), but not \(y_{i1}\).

Measuring uncertainty in EM parameter estimates

We might want to find something like standard errors for our parameter estimates. We can get asymptotic standard errors from the inverse of the information matrix: \(I(\theta^\star \mid Y_{(0)})^{-1}\): Starting with our standard decomposition of the observed data log likelihood:

\[ \ell_Y(\theta \mid Y_{(1)} = y_{(1)}, Y_{(0)} = \tilde{y}_{(0)}) = \log(f(Y_{(0)} = \tilde{y}_{(0)} \mid \theta)) + \log(f(Y_{(1)} = y_{(1)}\mid Y_{(0)} = \tilde{y}_{(0)}, \theta)) \]

Rearranging \[ \log(f(Y_{(0)} = \tilde{y}_{(0)} \mid \theta)) = \ell_Y(\theta \mid Y_{(1)} = y_{(1)}, Y_{(0)} = \tilde{y}_{(0)}) - \log(f(Y_{(1)} = y_{(1)}\mid Y_{(0)} = \tilde{y}_{(0)}, \theta)). \tag{1}\]

Make the following notational substitutions, where the function \(\ell(\theta ; V)\) is read as the log-likelihood as a function of \(\theta\) generated from the density corresponding to the distribution for random variable \(V\) indexed by unknown parameter \(\theta\). In what follows \(V\) can be conditional on another random variable, \(X\), as in \(V \mid X\). \[ \begin{aligned} \ell(\theta ; Y_{(0)}) & \equiv \log(f(Y_{(0)} = \tilde{y}_{(0)} \mid \theta)) \\ \ell(\theta ; Y_{(0)}, Y_{(1)}) & \equiv \log\lp f_Y(Y_{(1)} = y_{(1)}, Y_{(0)} = \tilde{y}_{(0)} \mid \theta)\rp \\ \ell(\theta ; Y_{(1)} \mid Y_{(0)}) & \equiv \log(f(Y_{(1)} = y_{(1)}\mid Y_{(0)} = \tilde{y}_{(0)}, \theta)) \end{aligned} \] We now have the following notation for log-likelihood derivatives:

\[ \ell_\theta(\theta^\prime ; V) \equiv \nabla_\theta \log(f(Y_{(0)} = \tilde{y}_{(0)} \mid \theta)) \Big|_{\theta = \theta^\prime}, \,\ell_{\theta\theta}(\theta^\prime ; V) \equiv \nabla^2_\theta \log(f(Y_{(0)} = \tilde{y}_{(0)} \mid \theta)) \Big|_{\theta = \theta^\prime} \]

We can differentiate Equation 1 twice and evaluate these at \(\theta = \theta^\star\) to get: \[ -\ell_{\theta\theta}(\theta^\star; Y_{(0)}) = -\ell_{\theta\theta}(\theta^\star; Y_{(0)},Y_{(1)}) - (-\ell_{\theta\theta}(\theta^\star; Y_{(1)} \mid Y_{(0)})) \] Taking expectations with respect random variable \(Y_{(1)} \mid Y_{(0)} = \tilde{y}_{(0)}, \theta^\star\) gives

\[ \begin{aligned} -\ell_{\theta\theta}(\theta^\star; Y_{(0)}) = \ExpA{-\ell_{\theta\theta}(\theta^\star; Y_{(0)},Y_{(1)})}{Y_{(1)} \mid Y_{(0)} = \tilde{y}_{(0)}, \theta^\star} - \ExpA{-\ell_{\theta\theta}(\theta^\star; Y_{(1)} \mid Y_{(0)})}{Y_{(1)}\mid Y_{(0)}=\tilde{y}_{(0)}, \theta^\star} \end{aligned} \tag{2}\] Using the Bartlett identity: \[ \ExpA{\ell_{\theta\theta}(\theta;X) + \ell_{\theta}(\theta;X)\ell_{\theta}(\theta;X)^T}{X \sim F_\theta} = 0 \] we can rewrite the second term on the RHS of Equation 2 as: \[ \begin{aligned} \ExpA{-\ell_{\theta\theta}(\theta^\star; Y_{(1)} \mid Y_{(0)})}{Y_{(1)}\mid Y_{(0)}=\tilde{y}_{(0)}, \theta^\star} & =\ExpA{\ell_{\theta}(\theta^\star; Y_{(1)} \mid Y_{(0)})\ell_{\theta}(\theta^\star; Y_{(1)} \mid Y_{(0)})^T}{Y_{(1)}\mid Y_{(0)}=\tilde{y}_{(0)}, \theta^\star} \\ & = \ExpA{\ell_{\theta}(\theta^\star; Y_{(0)}, Y_{(1)})\ell_{\theta}(\theta^\star; Y_{(0)}, Y_{(1)})^T}{Y_{(1)}\mid Y_{(0)}=\tilde{y}_{(0)}, \theta^\star} - \ExpA{\ell_{\theta}(\theta^\star; Y_{(0)})\ell_{\theta}(\theta^\star; Y_{(0)})^T}{Y_{(1)}\mid Y_{(0)}=\tilde{y}_{(0)}, \theta^\star} \\ & = \ExpA{\ell_{\theta}(\theta^\star; Y_{(0)},Y_{(1)})\ell_{\theta}(\theta^\star; Y_{(0)}, Y_{(1)})^T}{Y_{(1)}\mid Y_{(0)}=\tilde{y}_{(0)}, \theta^\star} \end{aligned} \] where the second line comes from the identity \(\ell_\theta(\theta^\star; Y_{(1)} \mid Y_{(0)}) = \ell_\theta(\theta^\star; Y_{(0)},Y_{(1)}) - \ell_\theta(\theta^\star; Y_{(0)})\) and the fact that we earlier proved that a stationary point of \(Q(\theta \mid \theta^t)\) was also a stationay point of \(\ell(\theta; Y_{(0)})\).

Furthermore, under regularity conditions allowing for gradients to be passed outside of integrals, the first term on the RHS in Equation 2 simplifies to \[ \ExpA{-\ell_{\theta\theta}(\theta^\star; Y_{(0)},Y_{(1)})}{Y_{(1)} \mid Y_{(0)} = \tilde{y}_{(0)}, \theta^\star} = -\nabla^2_\theta Q(\theta \mid \theta^t)\mid_{\theta = \theta^\star} \] Under the aforementioned regularity conditions, the simplification for the second term on the RHS is called Louis’ identity: \[ \begin{aligned} -\nabla^2_\theta H(\theta \mid \theta^t) = \ExpA{\ell_{\theta}(\theta^\star; Y_{(0)},Y_{(1)})\ell_{\theta}(\theta^\star; Y_{(0)}, Y_{(1)})^T}{Y_{(1)}\mid Y_{(0)}=\tilde{y}_{(0)}, \theta^\star} \end{aligned} \]

Thus, the final expression for the observed information is \[ \begin{aligned} -\ell_{\theta\theta}(\theta^\star; Y_{(0)}) = -\nabla^2_\theta Q(\theta \mid \theta^\star)\mid_{\theta = \theta^\star} - \ExpA{\ell_{\theta}(\theta^\star; Y_{(0)},Y_{(1)})\ell_{\theta}(\theta^\star; Y_{(0)}, Y_{(1)})^T}{Y_{(1)}\mid Y_{(0)}=\tilde{y}_{(0)}, \theta^\star}. \end{aligned} \]

Variants of EM

The standard EM algorithm has two important characteristics that may make it hard to apply in practice:

The expectation step of the log-likelihood might be intractable.
The M step might not be able to be done exactly.

There are variants of the EM algorithm that can handle both of these issue.

GEM: Generalized EM

If we can’t maximize the \(Q(\theta \mid \theta^t)\) exactly, we can instead set \(\theta^{t+1}\) so that \[ Q(\theta^t \mid \theta^t) \leq Q(\theta^{t+1} \mid \theta^t). \]

We showed last time that any value of \(\theta^{t+1}\) for which the above holds will lead to an increase in the observed likelihood, which is ultimately what we’re trying to maximize.

One can show that we still reach a stationary point for the observed likelihood under a GEM algorithm.

ECM

The first variant of EM is called ECM, or Expectation-Conditional Maximization. This is a GEM algorithm, so if we can’t do the maximizaton step exactly, we can instead do conditional maximization, which at least increases the \(Q\) function each iteration. We have an example of this sort of model from the first few weeks of class: the repeated measures model:

\[ \begin{aligned} y_{i} \mid X_{i} \, & = X_{i} \beta + \epsilon_{i} \\ \epsilon_{i} & \sim \text{Normal}(0, \Sigma) \\ \epsilon_{i} & \indy \epsilon_{j} \forall i\neq j \end{aligned} \]

We can look back in our notes to see that we have an iterative maximization scheme for this model:

\[ \beta^{(t+1)} = \left(\sum_i X_i^T (\Sigma^{(t)})^{-1}X_i\right)^{-1} \sum_i X_i^T (\Sigma^{(t)})^{-1}y_i \] and \[ \Sigma^{(t+1)} = \frac{1}{n}\sum_i (y_i - X_i \beta^{(t+1)}) (y_i - X_i \beta^{(t+1)})^T \]

Each step of the conditional maximization increased the complete data likelihood: \[ \begin{aligned} \ell_Y(\beta^{t+1}, \Sigma^{t} \mid Y) & \geq \ell_Y(\beta^{t}, \Sigma^{t} \mid Y) \\ \ell_Y(\beta^{t+1}, \Sigma^{t+1} \mid Y) & \geq \ell_Y(\beta^{t+1}, \Sigma^{t} \mid Y) \end{aligned} \] Thus, if we had missing outcome data in our regression, we could run an ECM algorithm with the following steps:

Initialize with \(\beta^1, \Sigma^1\)
For \(t = 2,\dots\) compute
1. Find the conditional expectations: \(\Exp{y_i \mid Y_{(0)}, X, \beta^t, \Sigma^t}\), \(\Exp{y_i y_i^T \mid Y_{(0)}, X, \beta^t, \Sigma^t}\)
2. Set \(\beta^{t+1}\) to: \[ \beta^{(t+1)} = \left(\sum_i X_i^T (\Sigma^{(t)})^{-1}X_i\right)^{-1} \sum_i X_i^T (\Sigma^{(t)})^{-1}\Exp{y_i \mid Y_{(0)}, X, \beta^t, \Sigma^t} \]
3. Set \(\Sigma^{t+1}\) to: \[ \Sigma^{(t+1)} = \frac{1}{n}\sum_i \Exp{(y_i - X_i \beta^{(t+1)}) (y_i - X_i \beta^{(t+1)})^T \mid Y_{(0)}, X, \beta^t, \Sigma^t} \]
4. If \(Q(\theta^{t+1} \mid \theta^t) - Q(\theta^{t} \mid \theta^t) \geq \epsilon\), continue, otherwise end

Monte Carlo EM

Monte Carlo EM (MCEM) algorithms can be used when the conditional expectation is not possible to do analytically. That is, if we cannot do this integral: \[ Q(\theta \mid \theta^{(t)}) = \int_{\mathcal{Y}_{(1)}}\ell_Y(\theta \mid Y_{(0)} = \tilde{y}_{(0)}, Y_{(1)} = y_{(1)} )f(Y_{(1)} = y_{(1)} \mid Y_{(0)} = \tilde{y}_{(0)}, \theta^{(t)})dy_{(1)} \] we can instead approximate it with a Monte Carlo estimator. Assuming we can draw samples from \(y_{(1)}^s \sim f(Y_{(1)} \mid Y_{(0)} = \tilde{y}_{(0)}, \theta^{(t)})\), we can compute:

\[ \hat{Q}(\theta \mid \theta^{(t)}) = \frac{1}{S}\sum_{s=1}^S \ell_Y(\theta \mid Y_{(0)} = \tilde{y}_{(0)}, Y_{(1)} = y^s_{(1)} ) \]

which we know: \[ \lim_{S\to\infty}\hat{Q}(\theta \mid \theta^{(t)}) = Q(\theta \mid \theta^{(t)}) \] This change in the algorithm isn’t without its complications, because we now have to determine how to measure convergence. While it is true that \(Q(\theta^{(t+1)} \mid \theta^{(t)}) \geq Q(\theta^{(t)} \mid \theta^{(t)})\), we now have to contend with the fact that there is noise in our assessment of convergence, so we need something more like: \[ P(\hat{Q}(\theta^{(t+1)} \mid \theta^{(t)}) - \hat{Q}(\theta^{(t)} \mid \theta^{(t)}) \geq \epsilon) \]