Missing data lecture 5

Maximum likelihood for multivariate normal distribution

Let \(y_i \in \R^K\), \(y_i \overset{\text{iid}}{\sim} \text{Normal}(\mu, \Sigma)\) for \(n\) samples so that the density for \(y_i\) is \[ f_{Y}(y_i \mid \mu, \Sigma) = (2\pi)^{-\frac{K}{2}} (\det\Sigma)^{-\frac{1}{2}} \exp\lp-\frac{1}{2}(y_i - \mu)^T \Sigma^{-1}(y_i - \mu) \rp \]

The log-likelihood is: \[ \ell_{Y}(\mu, \Sigma \mid y_i) = \frac{1}{2} \log \det\Sigma -\frac{1}{2}(y_i - \mu)^T \Sigma^{-1}(y_i - \mu) \]

The book gives the expressions for the MLEs of the mean and covariance matrix of the multivariate normal distribution without details. Going through the algebra can be useful for other more complicated problems. But in order to do so, we’ll need a slight change to how we’re used to thinking about partial differentiation. The following blurb on differentials is based on Magnus and Neudecker (2019).

Differentials and matrix differentiation

It all starts with rearranging the derivative:

\[ f^\prime(c) = \lim_{u \to 0} \frac{f(c + u) - f(c)}{u}, \] to get a linear approximation to \(f\) at the point \(c\):

\[ f(c + u) = f(c) + f^\prime(c) u + r_c(u) \] where \(r_c(u) = o(u)\) or \(\lim_{u\to0} \frac{r_c(u)}{u} = 0\). This is the one term Taylor expansion of the function \(f\) at \(c + u\) about \(c\).

Bringing \(f(u)\) to the left-hand side gives: \(f(c + u) - f(u) = f^\prime(c) u + r_c(u)\). We can define the change in the linear approximation of \(f\) from \(c\) to \(c+u\) as \(\mathrm{d}f(c;\,u)\), or the first differential of \(f\) at \(c\) with increment \(u\): \[ \mathrm{d} f(c ; u) = u f^\prime(c) \]

Subbing this back into the linear approximation for \(f\) gives: \[ f(c + u) = f(c) + \mathrm{d} f(c ; u) + r_c(u) \tag{1}\]

We can identify the differential by finding the linear approximation to a function at \(c\): \[ f(c + u) = f(c) + \alpha u + r_c(u). \] If we can find an \(\alpha\) that depends on \(c\) but not on \(u\) such that \(r_c(u) = o(u)\) we say that \(\alpha = f^\prime(c)\).

Let \(f\) now be a function \(\R^m \to \R\) and let the differential be constructed via the same argument as above, but now let \(c, u \in \R^m\), and define \(r_c(u)\) such that \(\lim_{u\to 0}\frac{r_c(u)}{\lVert u \rVert} = 0\): \[ f(c + u) = f(c) + A(c) u + r_c(u). \] If we equate the row vector \(A(c)\) with the partial derivative of \(f\) with respect to \(u\), we can recognize this as the multivariate Taylor expansion of \(f(c + u)\) around \(f(c)\).

In fact, this is exactly what \(A(c)\) is, and \(A(c) \equiv \nabla_x f(x) \mid_{x = c}\)

Example 1 (Product example) Let’s determine the differential of \(f(x, y) = x^T y\) for \(x,y \in \R^m\), denoted as: \[ \mathrm{d}(x^T y). \] We’re looking to use the left-hand side of in Equation 1 to yield something that looks like \(x^T y + \diff{x^Ty} u + r_c(u)\).

Let \(c_x\) and \(c_y\) be the values of \(x\) and \(y\) about which we’ll define our linear approximation. Specifically, we’ll define the linear approximation at the coordinates \(c_x + u_x, c_y + u_y\) In other words, we’d like to approximate the function: \((c_x + u_x)^T (c_y + u_y)\) with the value at \(c_x^T c_y\) plus the differential and a small remainder term. We’ll accomplish this by expanding the product \((c_x + u_x)^T (c_y + u_y)\) into four parts: \[ \begin{align} (u_x + c_x)^T (u_y + c_y) & = c_x^T c_y + c_x^T u_y + u_x^T c_y + u_x^T u_y \\ & = c_x^T c_y + c_x^T u_y + c_y^T u_x + u_x^T u_y \\ & = c_x^T c_y + \begin{bmatrix}c_x^T & c_y^T \end{bmatrix} \begin{bmatrix}u_y \\ u_x \end{bmatrix} + u_x^T u_y \tag{a}\label{eq:1} \end{align} \] In line \(\eqref{eq:1}\) we can see that \(f(c) \equiv c_x^T c_y\), and \(\lim_{u_x, u_y \to 0} \frac{u_x^T u_y}{\sqrt{\lVert u_x \rVert^2 + \lVert u_y \rVert^2}} = 0\), so \(u_x^T u_y\) is our \(r_c(u)\). That means that the vector \(A(c)\) here is \((c_x^T, c_y^T)\) as we’ve ordered our variables as \((u_y,u_x)\).

Rules for the differential operatior

  1. The differential operator is denote \(\diff{\cdot}\). It is linear: \[ \diff{a x + b y} = a\diff{x} + b\diff{y} \]

  2. The differential of a constant is zero: \[ \diff{a} = 0 \]

  3. The differential of a transposed variable is the transpose of the differential \[ \diff{x^T} = \diff{x}^T \]

  4. The differential of a product is the sum of the differential applied to each variable (as shown in Example 1) \[ \diff{XY} = \diff{X}Y + X\diff{Y} \]

Example 1 demonstrates another useful property of differentials. We recognize the fact that our variables partition naturally into two vectors, \(x\) and \(y\). When we have a natural partition of variables \(u\) into \(u_1\) and \(u_2\) we can write the differential for \(f(u)\) more easily in terms of two differentials: \[ \begin{aligned} \mathrm{d}f(c;\,u) & = A(c) u \\ & = A(c_1) u_1 + A(c_2)u_2 \end{aligned} \] which just differentiates between the two sets of variables, so that \(A(c_1)\) is the partial derivative of \(f\) with respect to \(u_1\) and \(A(c_2)\) is the partial derivative with respect to \(u_2\). In Example 1, \(u_1\) is \(u_y\) and \(u_2\) is \(u_x\), so we can write the differnetial above in the equivalent form:

\[ \begin{aligned} \mathrm{d}(x^T y) & = x^T u_y + u_x^T y \\ & = x^T u_y + y^T u_x \end{aligned} \] In order to simplify the notation, we’ll write \(\diff{x}\) instead of \(u_x\), so the above would be: \[ \begin{aligned} \mathrm{d}(x^T y) & = x^T \diff{y} + y^T \diff{x} \end{aligned} \]

This expression shows that we can read off \(\nabla_y x^T y\) as \(x^T\) and \(\nabla_x x^T y\) as \(y^T\). In fact, if we cared only about \(\diff{y}\) then we could ignore the differential \(\diff{x}\), essentially treating \(x\) as a constant so \(\diff{x} = 0\).

This is important for thinking about differentials of log-likelihoods like the multivariate normal where we’ll have two sets of parameters that we’d like to find the partial derivatives with respect to, \(\Sigma\) and \(\mu\): \[ \ell_Y(\mu, \Sigma \mid y) = \frac{1}{2} \log \det\Sigma -\frac{1}{2}(y_i - \mu)^T \Sigma^{-1}(y_i - \mu) \] \[ \mathrm{d}\ell_Y(\mu, \Sigma \mid y) = \frac{1}{2} \mathrm{d}(\log \det\Sigma) -\frac{1}{2}\mathrm{d}((y_i - \mu)^T \Sigma^{-1}(y_i - \mu)) \tag{2}\]

To the extent we’d prefer to focus only on \(\diff{\mu}\) for example, we would ignore the first term on the RHS of Equation 2, and focus only on the second term.

Generalizing to vector valued functions

We can generalize to vector functions: Let \(f(x): \R^m \to \R^n\): \[ f(c + u) = f(c) + A(c) u + r_c(u). \] for \(\lim_{u\to 0} r_c(u) / \norm{u} = 0\). Then \(\mathrm{d}f(c;u) = A(c)u\) is the differential of \(f\) evaluated at \(c\) of increment \(u\).

In fact, this is the multivariate Taylor expansion. Each row of \(A(c)\) is the term \(\nabla_x f_i(x) \mid_{x = c}\) where \(f_i\) is the \(i^\mathrm{th}\) entry of the length-\(n\) vector \(f(x)\).

Generalizing to matrix valued functions

The same idea applies to matrices, when combined with the \(\text{vec}\) function, which concatenates an \(n \times p\) matrix column by column into an \(n \times p\)-length vector. Let \(F\) be a matrix function \(\R^{n \times q} \to \R^{m \times p}\). Let \(C\) and \(U\) be in \(\R^{m\times q}\). If \(A(C) \in \R^{mp \times nq}\) such that: \[ \text{vec}(F(C + U)) = \text{vec}(F(C)) + A(C) \text{vec}(U) + \text{vec}(R_c(U)). \] Then the \(m\times p\) matrix \(\mathrm{d} F(C;\,U)\) is defined by \(\text{vec}(\mathrm{d} F(C;\,U)) = A(C) \text{vec}(U)\).

The reason to do this is because the differential generalizes to matrices a bit easier than do partial derivatives. This is because it isn’t clear along which dimensions the partial derivatives should lie: Should the partials of a matrix function become a third dimension, like a 3-d array?

Under this framework, the rows of the matrices \(A(c)\) and \(A(C)\) correspond to a dimension of the range of the function \(f(c)\) or \(F(C)\), while the columns correspond to a dimension of the domain.

The chain rule

The power of the differentials is laid bare when working through the chain rule, which is called Cauchy’s rule of invariance in differential-land. Let \(f: \R^m \to \R^p\) and \(g: \R^p \to \R^n\), and let \(h = g \circ f\). Then \(h: \R^m \to \R^n\). If \(b = f(c)\) and \(h = g(b)\) the differential of \(h\) is: \[ \begin{aligned} \mathrm{d}(h;\,u) & = \mathrm{d}(h;\,\mathrm{d}(f;\,c)) \\ & = A_{g}(b) A_{f}(c) u \end{aligned} \] where \(A_g(b) \in R^{n \times p}\) and \(A_f(c) \in R^{p \times m}\) and \(u \in \R^m\).

We can show this rigorously with our previous definitions. \[ \begin{align} h(c + u) & = g(f(c + u)) \\ & = g(f(c) + A_f(c)u + r_c(u)) \\ & = g(b + v) \tag{$v = A_f(c)u + r_c(u)$}\\ & = g(b) + A_g(b) v + r_b(v) \\ & = g(b) + A_g(b) \lp A_f(c) u + r_c(u) \rp + r_b(A_f(c) u + r_c(u)) \\ & = h(c) + A_g(b) A_f(c) u + A_g(b) r_c(u) + r_c(u) \\ & = h(c) + A_g(b) A_f(c) u + r_c(u) \end{align} \]

Differential with respect to \(\mu\)

First we’ll ignore the differential with respect to \(\Sigma\). We’ll expand out that quadratic form into the parts that depend only on \(\mu\): \[ \mathrm{d} \ell_{Y}(\mu, \Sigma \mid y_i) = y_i^T\Sigma^{-1}\mathrm{d}\mu - \frac{1}{2}\mathrm{d}(\mu^T\Sigma^{-1}\mu) \]

Taking the gradient with respect to \(\mu\) we get: \[ \begin{aligned} \mathrm{d} \ell_{Y}(\mu, \Sigma \mid y_i) & = y_i^T\Sigma^{-1}\mathrm{d} \mu - \frac{1}{2}\mathrm{d}(\mu^T)\Sigma^{-1}\mu - \frac{1}{2}\mu^T\mathrm{d}(\Sigma^{-1}\mu) \\ & = y_i^T\Sigma^{-1}\mathrm{d} \mu - \frac{1}{2}\mathrm{d}(\mu)^T\Sigma^{-1}\mu - \frac{1}{2}\mu^T\Sigma^{-1}\mathrm{d}\mu \\ & = y_i^T\Sigma^{-1}\mathrm{d} \mu - \frac{1}{2}\mu^T\Sigma^{-1}\mathrm{d}\mu - \frac{1}{2}\mu^T\Sigma^{-1}\mathrm{d}\mu \\ & = y_i^T\Sigma^{-1}\mathrm{d} \mu - \mu^T\Sigma^{-1}\mathrm{d}\mu \\ & = (y_i - \mu)^T\Sigma^{-1}\mathrm{d} \mu \\ \end{aligned} \] If we sum over the \(n\) terms of the log-likelihood we get: \[ \begin{aligned} \frac{\partial \ell_{Y}(\mu, \Sigma \mid y_i)}{\partial \mu} & = (\sum_i y_i - n \mu)^T\Sigma^{-1} \end{aligned} \] leading to the MLE for \(\mu\): \[ \hat{\mu} = \frac{1}{n}\sum_i y_i \]

It’ll be useful to write the log-likelihood a bit differently to find the MLE for \(\Sigma\). Remember that \(\det A^{-1} = (\det A)^{-1}\). This will enable us to write everything in terms of \(\Sigma^{-1}\) instead of \(\Sigma\): \[ \ell_{Y}(\mu, \Sigma \mid y_i) = \frac{1}{2} \log (\det\Sigma^{-1}) - \frac{1}{2}(y_i - \mu)^T \Sigma^{-1}(y_i - \mu) \] Also remember that \(\text{tr}(A) = \sum_{i} A_{ii}\), \(\text{tr}(A + B) = \text{tr}(A) + \text{tr}(B)\), and that \(f(x) = \text{tr}(f(x))\) for a univariate function \(f(x)\). Finally, recall that \(\text{tr}(ABC) = \text{tr}(CAB) = \text{tr}(BCA)\). This will let us rewrite the

Putting all this together allows us to write the log-likelihood for the multivariate normal as such: \[ \ell_{Y}(\mu, \Sigma \mid y_i) = \frac{1}{2} \log (\det\Sigma^{-1}) -\frac{1}{2}\text{tr}\lp(y_i - \mu) (y_i - \mu)^T \Sigma^{-1}\rp \]

For the partial derivative of \(\det \Sigma^{-1}\) with respect to \(\Sigma^{-1}\), we get \[ \frac{\partial \det \Sigma^{-1}}{\partial \Sigma^{-1}} = \det \Sigma^{-1} ((\Sigma^{-1})^{-1})^{T} \] and for the partial derivative of \(\text{tr}(AB)\) with respect to \(B\) we get \(A^T\), so the partial derivative with respect to \(\Sigma^{-1}\) of the log-likelihood gives us: \[ \frac{\partial \ell_{Y}(\mu, \Sigma \mid y_i)}{\partial \Sigma^{-1}} = \frac{1}{2} \Sigma -\frac{1}{2}(y_i - \mu) (y_i - \mu)^T \] Summing over the \(n\) terms gives: \[ \frac{\partial \ell_{Y}(\mu, \Sigma \mid Y)}{\partial \Sigma^{-1}} = \frac{n}{2} \Sigma -\frac{1}{2}\sum_i (y_i - \mu) (y_i - \mu)^T \] \[ \hat{\Sigma} = \frac{1}{n}\textstyle\sum_i (y_i - \hat{\mu}) (y_i - \hat{\mu})^T \]

Normal repeated measures models

In many longitudinal studies where some outcome of interest is measured for participants \(K\) times, the following model may describe the data generating process well, where \(y_i \in \R^K\) and \(X_i\) is a \(K \times m\) design matrix: \[ \begin{aligned} y_i \mid X_i \sim \text{MultiNormal}(X_i \beta, \Sigma(\psi)) \end{aligned} \] The textbook lists several models that could describe different scenarios.

  1. Independent-but-not-identically-distributed observations within groups: \[ \begin{aligned} y_{ik} \mid (X_{i})_k \, & = (X_{i})_k \beta + \epsilon_{ik} \\ \epsilon_{ik} & \sim \text{Normal}(0, \sigma^2_k) \\ \epsilon_{ik} & \indy \epsilon_{jl} \forall i\neq j \cup k \neq l \end{aligned} \] This implies the following simple structure for \(\Sigma(\psi)\) above:

\(\Sigma(\psi) = \text{diag}(\sigma^2_1, \dots, \sigma^2_K)\).

  1. Compound symmetry (I’ll call this a random intercept model): \[ \begin{aligned} y_{ik} \mid (X_{i})_k \, & = (X_{i})_k \beta + \gamma_i + \epsilon_{ik} \\ \epsilon_{ik} & \sim \text{Normal}(0, \sigma^2) \\ \epsilon_{ik} & \indy \epsilon_{jl} \forall i\neq j \cup k \neq l \\ \gamma_i & \sim \text{Normal}(0, \tau^2) \\ \gamma_i & \indy \gamma_j \forall i \neq j \\ \gamma_i & \indy \epsilon_{ij} \forall j \end{aligned} \] Conditional on \(X_i\), the covariance between \(y_{ik}\) and \(y_{ij}\) is: \[ \begin{aligned} \text{Cov}(y_{ik}, y_{ij} \mid X_i) & = \text{Cov}((X_{i})_k \beta + \gamma_i + \epsilon_{ik}, (X_{i})_j \beta + \gamma_i + \epsilon_{ij} \mid X_i) \\ & = \tau^2 \end{aligned} \] While the variance is \(\tau^2 + \sigma^2\). This implies that we can write the variance-covariance matrix as \[ \Sigma(\psi) = \tau^2 1_{K} 1_K^T + \sigma^2 I_K. \]

  2. Autoregressive (I’ll call this a random intercept model):

\[ \begin{aligned} y_{ik} \mid (X_{i})_k \, & = (X_{i})_k \beta + \epsilon_{ik} \\ \epsilon_{ik} \mid \epsilon_{i,k-1} & \sim \text{Normal}(\rho \epsilon_{i,k-1}, \sigma^2) \\ \epsilon_{i1} & \sim \text{Normal}(0, \frac{\sigma^2}{1 - \rho^2}) \\ \epsilon_{ik} & \indy \epsilon_{jl} \forall i\neq j \cap \forall k, l \end{aligned} \]

This implies the following simple structure for \(\Sigma(\psi)\):

\[\Sigma(\psi)_{ij} = \frac{\sigma^2}{1-\rho^2} \rho^{\abs{i-j}}\]

  1. Random effects model

This is a more general version of the random intercept model. Let \(z_k \in \R^q\).

\[ \begin{aligned} y_{ik} \mid (X_{i})_k \, & = (X_{i})_k \beta + z_k^T \gamma_i + \epsilon_{ik} \\ \epsilon_{ik} & \sim \text{Normal}(0, \sigma^2) \\ \epsilon_{ik} & \indy \epsilon_{jl} \forall i\neq j \cup k \neq l \\ \gamma_i & \sim \text{Normal}(0, \Omega) \\ \gamma_i & \indy \gamma_j \forall i \neq j \\ \gamma_i & \indy \epsilon_{ij} \forall j \end{aligned} \] We can write this in matrix form as:

\[ y_i \mid X_i = X_i \beta + Z \gamma + \epsilon_i \] The conditional covariance is \[ \begin{aligned} \text{Cov}(y_i \mid X_i) & = \text{Cov}(X_i \beta + Z \gamma + \epsilon_i \mid X_i) \\ & = Z \Omega Z^T + \sigma^2 I_K \end{aligned} \]

  1. Hierarchical Gaussian process model

Suppose we have time points \(t_{i1}, \dots, t_{iK}\) associated with each measurement \(y_{i1}, \dots, y_{iK}\).

Let’s define the function \(\Omega\), which is from \(\R^K \to \Sigma\) where \(\Sigma\) is the space of positive definite \(m\times m\) matrices. Let \(t_i \in \R^K\) and let \(\Omega(\mathbf{t} \mid \ell, \sigma^2)\) be defined \[ \Omega(t_i \mid \ell, \sigma^2)_{jk} = \sigma^2 \exp(-\lp t_{ij} - t_{ik} \rp^2/(2\ell^2)) \] Then the following model is a hierarchical Gaussian process model

\[ \begin{aligned} y_{i} \mid X_{i} \, & \sim \text{MultiNormal} \lp X_{i} \beta, \Omega(t_i \mid \ell, \sigma^2)\rp \end{aligned} \]

MLEs in repeated measure models

The book suggests the following strategy to find the MLEs in the unstructured case, which is \(1\) above:

Take \(\beta^{(0)}\) and \(\Sigma^{(0)}\) as initial guesses. Then for \(t = 1\) until some termination criterion iterate:

\[ \beta^{(t+1)} = \left(\sum_i X_i^T (\Sigma^{(t)})^{-1}X_i\right)^{-1} \sum_i X_i^T (\Sigma^{(t)})^{-1}y_i \] and \[ \Sigma^{(t+1)} = \frac{1}{n}\sum_i (y_i - X_i \beta^{(t+1)}) (y_i - X_i \beta^{(t+1)})^T \] We can derive these update rules from the log-likelihood, but we’ll need to rewrite the model so that it looks a little more familiar.

The model as written in matrix form by unit \(i\) is: \[ \begin{aligned} y_{i} \mid X_{i} \, & = X_{i} \beta + \epsilon_{i} \\ \epsilon_{i} & \sim \text{Normal}(0, \Sigma) \\ \epsilon_{i} & \indy \epsilon_{j} \forall i\neq j \end{aligned} \] Let \(y = (y_1^T, y_2^T, \dots, y_n^T)^T\) and let \(X = (X_1^T, X_2^T, \dots, X_n^T)^T\), and let \(\epsilon = (\epsilon_1^T, \epsilon_2^T, \dots, \epsilon_n^T)^T\). Then the model can be written: \[ \begin{aligned} y \mid X \, & = X \beta + \epsilon \\ \epsilon & \sim \text{Normal}(0, I_n \otimes \Sigma) \end{aligned} \] so \(\text{Cov}(\epsilon)\) is block-diagonal: \[ \begin{bmatrix} \Sigma & 0 & \dots & 0 \\ 0 & \Sigma & \dots & 0 \\ 0 & 0 & \ddots & 0 \\ 0 & 0 & \dots & \Sigma \end{bmatrix} \] The log-likelihood is:

\[ \ell_{Y}(\mu, \Sigma \mid y_i) = -\frac{1}{2} \log \det(I_n \otimes \Sigma) -\frac{1}{2}(y - X \beta)^T (I_n \otimes \Sigma)^{-1}(y - X \beta) \] The determinant of \(I_n \otimes \Sigma\) is \(\det(\Sigma)^n\) because it’s just block-diagonal, and the inverse of \(I_n \otimes \Sigma\) is similarly \(I_n \otimes \Sigma^{-1}\).

Let’s focus on the \(\beta\) terms. Expanding the quadratic form gives: \[ -\frac{1}{2}(y^T(I_n \otimes \Sigma)^{-1}y + y^T (I_n \otimes \Sigma)^{-1}X \beta -\frac{1}{2} \beta^T X^T (I_n \otimes \Sigma)^{-1}X \beta \]

Taking the derivative with respect to \(\beta\) gives: \[ y^T (I_n \otimes \Sigma)^{-1}X \mathrm{d}\beta -\frac{1}{2} \mathrm{d} \beta^T X^T (I_n \otimes \Sigma)^{-1}X \beta -\frac{1}{2} \beta^T X^T (I_n \otimes \Sigma)^{-1}X \mathrm{d}\beta \] Collecting terms gives: \[ (y^T (I_n \otimes \Sigma^{-1})X - \beta^T X^T (I_n \otimes \Sigma)^{-1}X) \mathrm{d}\beta \] This looks more daunting than it is, we can use block matrix multiplication to get: \[ (\sum_i y_i^T \Sigma^{-1} X_i - \beta^T \sum_i X_i^T \Sigma^{-1} X ) \mathrm{d}\beta \] If \(\Sigma\) were known, we could solve this equation simply:

\[ \hat{\beta} = \textstyle(\sum_i X_i^T \Sigma^{-1} X)^{-1} (\sum_i X_i \Sigma^{-1} y_i) \]

Like we did above, we can rewrite the likelihood in terms of \(\Sigma^{-1}\) to give:

\[ \begin{aligned} \ell_{Y}(\mu, \Sigma \mid y_i) & = \frac{n}{2} \log \det(\Sigma^{-1}) - \frac{1}{2}\sum_i(y_i - X_i \beta)^T\Sigma^{-1}(y_i - X_i \beta) \\ & = \frac{n}{2} \log \det(\Sigma^{-1}) -\frac{1}{2}\sum_i \text{tr}(y_i - X_i \beta)^T\Sigma^{-1}(y_i - X_i \beta) \\ & = \frac{n}{2} \log \det(\Sigma^{-1}) -\frac{1}{2}\sum_i \text{tr}(y_i - X_i \beta)(y_i - X_i \beta)^T\Sigma^{-1} \\ & = \frac{n}{2} \log \det(\Sigma^{-1}) -\frac{1}{2}\lp \sum_i \text{tr}(y_i - X_i \beta)(y_i - X_i \beta)^T\rp\Sigma^{-1} \\ \end{aligned} \] Taking derivatives with respect to \(\Sigma^{-1}\) gives:

\[ \begin{aligned} \mathrm{d} \ell_{Y}(\mu, \Sigma \mid y_i) & = \frac{n}{2} \Sigma\,\mathrm{d} \Sigma^{-1} -\frac{1}{2}\sum_i (y_i - X_i \beta) (y_i - X_i \beta)^T \mathrm{d} \Sigma^{-1} \\ & = \lp \frac{n}{2} \Sigma -\frac{1}{2}\sum_i (y_i - X_i \beta) (y_i - X_i \beta)^T\rp \mathrm{d} \Sigma^{-1} \end{aligned} \] Both of these derivatives have to be zero at the maximum likeihood estimate (assuming we’re not on a boundary of the parameter space), so we’ll get two sets of equations:

\[ \Sigma^{(t+1)} = \frac{1}{n} \sum_i (y_i - X_i \beta^{(t)}) (y_i - X_i \beta)^T \]

\[ \beta^{(t+1)} = \textstyle(\sum_i X_i^T (\Sigma^{(t)})^{-1} X_i)^{-1} (\sum_i X_i (\Sigma^{(t)})^{-1} y_i) \]

References

Magnus, Jan R, and Heinz Neudecker. 2019. Matrix Differential Calculus with Applications in Statistics and Econometrics. John Wiley & Sons.