REML thoughts

1 REML derivation three ways

This derivation follows: Smyth and Verbyla (1996), and Harville (1974).

Let \(Y \in \R^n\), \(X \in \R^{n \times p}\), with rank \(p\) and let \(Y \sim \text{Normal}(X \beta, \Sigma(\gamma))\), \(\gamma \in \R^m\), with density: \[ f_Y(y \mid \beta, \gamma) = (2\pi)^{-n / 2}\det \Sigma(\gamma)^{-1/2}\exp\lp -\frac{1}{2} \lp y - X \beta \rp^T \Sigma(\gamma)^{-1}\lp y - X \beta \rp \rp \] We’ll show that \[ \lp y - X \beta \rp^T \Sigma(\gamma)^{-1}\lp y - X \beta \rp = (\hat{\beta} - \beta)^T X^T \Sigma(\gamma)^{-1} X(\hat{\beta} - \beta) + (K^Ty)^T (K^T \Sigma(\gamma) K) K^Ty \]

where, for any \(C \in \R^{n - p \times n}\) with full row rank such that \(K^T K = I_{n-p}\), \[ \hat{\beta} = (X^T \Sigma(\gamma)^{-1} X)^{-1} X^T\Sigma(\gamma)^{-1} y, \quad K = C^T(I_n - X(X^T X)^{-1} X^T) \] Let’s work on the log scale: \[ \log f_Y(y \mid \beta, \gamma) = -\frac{n}{2} \log (2\pi) - \frac{1}{2} \log \det \Sigma(\gamma) -\frac{1}{2} \lp y - X \beta \rp^T \Sigma(\gamma)^{-1}\lp y - X \beta \rp \] Let a decomposition of \(\Sigma(\gamma)^{-1}\) be \(L^T L\), so we can rewrite the quadratic term: \[\begin{align} \lp y - X \beta \rp^T \Sigma(\gamma)^{-1}\lp y - X \beta \rp & = (L \lp y - X \beta \rp)^T L \lp y - X \beta \rp\\ & = \lp L y - LX \beta \rp^T \lp L y - L X \beta \rp \end{align}\] Then we introduce the following variables: \(u = Ly\) and \(W = LX\) so the quadratic becomes: \[ (u - W \beta)^T (u - W \beta) \] We can project \(u\) onto the column space of \(W\) as the usual least-squares estimator: \[ W \hat{\beta} = W (W^T W)^{-1} W^T u = L X(X^T \Sigma(\gamma)^{-1} X)^{-1} X^T \Sigma(\gamma)^{-1} y \] Then we can rewrite the quadratic \((u - W \beta)\): \[ (W \hat{\beta} + u - W \hat{\beta} - W \beta)^T(W \hat{\beta} + u - W \hat{\beta} - W \beta) \] This expands to: \[ (W \hat{\beta} - W \beta)^T(W \hat{\beta} - W \beta) + (u - W\hat{\beta})^T (u - W\hat{\beta}) + 2(W \hat{\beta} - W \beta)^T(u - W\hat{\beta}) \] Note that \((W \hat{\beta} - W \beta)^T(u - W\hat{\beta}) = 0\) as the residuals of the regression of \(u\) on \(W\) are orthogonal to the column space of \(W\). We now have the following expression for the quadratic term: \[\begin{align} (W \hat{\beta} - W \beta)^T(W \hat{\beta} - W \beta) & + (u - W\hat{\beta})^T (u - W\hat{\beta}) \\ & = (L X \hat{\beta} - L X \beta)^T(L X \hat{\beta} - L X \beta) + (L y - L X\hat{\beta})^T (L y - L X\hat{\beta}) \\ & = (\hat{\beta} - \beta)^TX^T\Sigma(\gamma)^{-1} X (\hat{\beta} - \beta) + (y - X\hat{\beta})^T \Sigma(\gamma)^{-1} (y - X \hat{\beta}) \end{align}\] The expression \((y - X\hat{\beta})^T \Sigma(\gamma)^{-1} (y - X \hat{\beta})\) simplifies: \[\begin{align} & (y - X\hat{\beta})^T \Sigma(\gamma)^{-1} (y - X \hat{\beta}) \\ & = (y - X (X^T \Sigma(\gamma)^{-1} X)^{-1} X^T \Sigma(\gamma)^{-1} y)^T \Sigma(\gamma)^{-1} (y - X(X^T \Sigma(\gamma)^{-1} X)^{-1} X^T \Sigma(\gamma)^{-1} y)\\ & = y^T(I - X (X^T \Sigma(\gamma)^{-1} X)^{-1} X^T \Sigma(\gamma)^{-1})^T \Sigma(\gamma)^{-1} (I - X (X^T \Sigma(\gamma)^{-1} X)^{-1} X^T \Sigma(\gamma)^{-1} )y \end{align}\] The matrix in the quadratic form simplifies again: \[\begin{align} & (I - X (X^T \Sigma(\gamma)^{-1} X)^{-1} X^T \Sigma(\gamma)^{-1})^T \Sigma(\gamma)^{-1} (I - X (X^T \Sigma(\gamma)^{-1} X)^{-1} X^T \Sigma(\gamma)^{-1} )\\ & =(I - \Sigma(\gamma)^{-1} X (X^T \Sigma(\gamma)^{-1} X)^{-1} X^T ) (\Sigma(\gamma)^{-1} - \Sigma(\gamma)^{-1} X (X^T \Sigma(\gamma)^{-1} X)^{-1} X^T \Sigma(\gamma)^{-1} ) \\ & = \Sigma(\gamma)^{-1} - 2\Sigma(\gamma)^{-1} X (X^T \Sigma(\gamma)^{-1} X)^{-1} X^T \Sigma(\gamma)^{-1} \\ & \quad +\Sigma(\gamma)^{-1} X (X^T \Sigma(\gamma)^{-1} X)^{-1} X^T \Sigma(\gamma)^{-1} X (X^T \Sigma(\gamma)^{-1} X)^{-1} X^T \Sigma(\gamma)^{-1} \\ & = \Sigma(\gamma)^{-1} - \Sigma(\gamma)^{-1} X (X^T \Sigma(\gamma)^{-1} X)^{-1} X^T \Sigma(\gamma)^{-1} \end{align}\]

Finally we have the full expression for the quadratic form: \[ (\hat{\beta} - \beta)^TX^T\Sigma(\gamma)^{-1} X (\hat{\beta} - \beta) + y^T \lp\Sigma(\gamma)^{-1} - \Sigma(\gamma)^{-1} X (X^T \Sigma(\gamma)^{-1} X)^{-1} X^T \Sigma(\gamma)^{-1}\rp y \]

The matrix \(\lp\Sigma(\gamma)^{-1} - \Sigma(\gamma)^{-1} X (X^T \Sigma(\gamma)^{-1} X)^{-1} X^T \Sigma(\gamma)^{-1}\rp\) has a simplification, shown in § M.4f of McCulloch, Searle, and Neuhaus (2008):

Proposition 1 (§ M.4f of McCulloch, Searle, and Neuhaus (2008)) For any \(K\) such that \(K^T X = 0\). \[ K(K^T \Sigma K)^{-1}K^T = \Sigma(\gamma)^{-1} - \Sigma(\gamma)^{-1} X (X^T \Sigma(\gamma)^{-1} X)^{-1} X^T \Sigma(\gamma)^{-1} \]

As we’ve defined \(K^T\) as \(C^T(I_n - X(X^T X)^{-1} X^T)\), this holds, and we’re left with the following expression for the quadratic form:

\[ (\hat{\beta} - \beta)^TX^T\Sigma(\gamma)^{-1} X (\hat{\beta} - \beta) + (K^Ty)^T \lp K^T \Sigma K\rp^{-1} K^T y \] We can’t use this directly as a density because we’ve done a transformation of \(y\) to statistics \(t\) and \(a\): \[ t = (X^T \Sigma(\gamma)^{-1} X)^{-1} X^T \Sigma(\gamma)^{-1} y, \quad a = K^T y \tag{1}\] We have the likelihood expressed in terms of \(t\) and \(a\) already, but we need to account for the Jacobian. Let \(B = (X^T \Sigma(\gamma)^{-1} X)^{-1} X^T \Sigma(\gamma)^{-1}\)

\[ \begin{bmatrix} t \\ a \end{bmatrix} = \begin{bmatrix} B^T \\ K^T \end{bmatrix}y \] We need the determinant of the Jacobian of \((t,a) \to y\), which we can get by taking the reciprocal of the Jacobian \(y \to (t,a)\), which is \[ \det \begin{bmatrix} B^T \\ K^T \end{bmatrix} \] We can simplify this expression by noting that \[ \begin{aligned} \det \begin{bmatrix} B^T \\ K^T \end{bmatrix} & = \sqrt{\det \begin{bmatrix} B^T \\ K^T \end{bmatrix}\begin{bmatrix} B^T \\ K^T \end{bmatrix}^T} \\ & = \det \begin{bmatrix} B^T B & B^TK\\ K^T B & K^T K \end{bmatrix}^{1/2} \end{aligned} \] Using the block matrix determinant formula, we get that: \[ \begin{aligned} \det \begin{bmatrix} B^T B & B^TK\\ K^T B & K^T K \end{bmatrix}^{1/2} & = \det K^T K^{1/2} \det (B^T B - B^T K(K^T K)^{-1} K^T B)^{1/2} \\ & = \det K^T K^{1/2} \det (B^T(I - K(K^T K)^{-1} K^T)B)^{1/2} \\ & = \det K^T K^{1/2} \det (B^T X(X^TX)^{-1}X^T B)^{1/2} \\ & = \det K^T K^{1/2} \det ((X^T \Sigma(\gamma)^{-1} X)^{-1} X^T \Sigma(\gamma)^{-1} X(X^TX)^{-1}X^T \Sigma(\gamma)^{-1} X (X^T \Sigma(\gamma)^{-1} X)^{-1})^{1/2} \\ & = \det K^T K^{1/2} \det ((X^TX)^{-1})^{1/2} \\ & = \det ((X^TX)^{-1})^{1/2} \end{aligned} \] Thus we get that the Jacobian of our transformation is \(\lp\det ((X^TX)^{-1})^{1/2}\rp^{-1}\) or \(\sqrt{\det X^T X}\)

1.1 TL;DR Rewriting the joint distribution

Finally we can write the joint density for \(y\): \[ \begin{aligned} f_Y(y \mid \beta, \gamma) & = (2\pi)^{-n / 2}\det \Sigma(\gamma)^{-1/2}\exp\lp -\frac{1}{2} \lp y - X \beta \rp^T \Sigma(\gamma)^{-1}\lp y - X \beta \rp \rp \end{aligned} \]

as a joint density in \((t,a)\), defined as a 1-1 transformation of \(y\) in Equation 1, and with \(K\) defined in Proposition 1:

\[ \begin{aligned} f_{t,a}(t,a \mid \beta, \gamma) & = \det(X^T X)^{1/2}(2\pi)^{-n / 2}\det \Sigma(\gamma)^{-1/2}\exp\lp -\frac{1}{2} \lp t - \beta \rp^T X^T \Sigma(\gamma)^{-1}X\lp t - \beta \rp \rp\\ & \quad \times\exp\lp -\frac{1}{2} a^T (K^T \Sigma(\gamma)K )^{-1} a \rp \end{aligned} \tag{2}\]

1.2 REML by marginalizing \(t\)

We see that the joint density for \((t,a)\) factorizes. Thus we can derive the marginal distribution for \(a\) by integrating out \(t\):

\[ \begin{aligned} \det(X^T \Sigma(\gamma)^{-1}X)^{-1/2} (2\pi)^{p / 2} = \int_{\R^p} \exp\lp -\frac{1}{2} \lp t - \beta \rp^T X^T \Sigma(\gamma)^{-1}X\lp t - \beta \rp \rp\\ \end{aligned} \] which leaves the following marginal distribution for \(a\):

\[ \begin{aligned} f_{a}(a \mid \gamma) & = \det(X^T \Sigma(\gamma)^{-1}X)^{-1/2} \det(X^T X)^{1/2}(2\pi)^{-(n-p) / 2}\det \Sigma(\gamma)^{-1/2}\\ & \quad \times\exp\lp -\frac{1}{2} a^T (K^T \Sigma(\gamma)K )^{-1}a \rp \end{aligned} \] Note that by the equivalence in McCulloch, Searle, and Neuhaus (2008), the expression is:

\[ \begin{aligned} f_{a}(a \mid \gamma) & = \det(X^T \Sigma(\gamma)^{-1}X)^{-1/2} \det(X^T X)^{1/2}(2\pi)^{-(n-p) / 2}\det \Sigma(\gamma)^{-1/2}\\ & \quad \times\exp\lp -\frac{1}{2} y^T \lp\Sigma(\gamma)^{-1} - \Sigma(\gamma)^{-1} X (X^T \Sigma(\gamma)^{-1} X)^{-1} X^T \Sigma(\gamma)^{-1}\rp y \rp \end{aligned} \tag{3}\]

1.3 REML by conditioning on \(t\)

We just showed that the marginal density for \(a\) is of the form:

\[ \begin{aligned} f_{a}(a \mid \gamma) & = \det(X^T \Sigma(\gamma)^{-1}X)^{-1/2} \det(X^T X)^{1/2}(2\pi)^{-(n-p) / 2}\det \Sigma(\gamma)^{-1/2}\\ & \quad \times\exp\lp -\frac{1}{2} a^T (K^T \Sigma(\gamma)K )^{-1}a \rp \end{aligned} \]

This means that we can marginalize out \(a\) from the joint distribution to get the marginal distribution for \(t\):

\[ \begin{aligned} \int_{\R^{n-p}} f_{t,a}(t,a \mid \beta, \gamma)da & = \det(X^T \Sigma(\gamma)^{-1}X)^{1/2}(2\pi)^{-p / 2}\exp\lp -\frac{1}{2} \lp t - \beta \rp^T X^T \Sigma(\gamma)^{-1}X\lp t - \beta \rp \rp \end{aligned} \] which shows that \(f(t \mid \beta, \gamma)\) is multivariate normal. We can then condition \(Y\) on \(T = t\) to get the density \(f_{y \mid t}\):

\[ \begin{aligned} \frac{f_Y(y \mid \beta, \gamma)}{f_{\beta}(t \mid \beta, \gamma)} & = (2\pi)^{-(n-p) / 2}\det(X^T X)^{1/2}\det(X^T \Sigma(\gamma)^{-1}X)^{-1/2}\det \Sigma(\gamma)^{-1/2}\\ & \quad \times\exp\lp -\frac{1}{2} \lp K^T y \rp^T (K^T \Sigma(\gamma)K )^{-1}\lp K^T y \rp \rp \end{aligned} \] again giving the REML likelihood. I’ve left the \(y\) in the resulting likelihood because it emphasizes that conditioning of \(y\) on \(t\).

1.4 REML by Bayes (integrating out \(\beta\))

This is the development in Harville (1974).

Suppose we wish to do Bayesian inference on \(\gamma\) and thus our target is the marginal posterior \(\pi(\gamma \mid y) = \int_{\R^p} \pi(\gamma, \beta \mid y)d\beta\). Suppose our joint prior for \(\gamma, \beta\) is \(\pi(\gamma, \beta) \propto g(\gamma)\). Our joint posterior is:

\[ \begin{aligned} \pi(\beta, \gamma\mid t,a) & \propto \det(X^T X)^{1/2}(2\pi)^{-n / 2}\det \Sigma(\gamma)^{-1/2}\exp\lp -\frac{1}{2} \lp t - \beta \rp^T X^T \Sigma(\gamma)^{-1}X\lp t - \beta \rp \rp\\ & \quad \times\exp\lp -\frac{1}{2} a^T (K^T \Sigma(\gamma)K )^{-1} a \rp g(\gamma) \end{aligned} \]

This factorizes into a conditional distribution for \(\beta \mid \gamma\) and a marginal distribution for \(\gamma\):

\[ \begin{aligned} \pi(\beta, \gamma\mid t,a) & \propto \pi(\beta \mid \gamma, t) \pi(\gamma \mid a) \\ & \propto \det(X^T X)^{1/2}(2\pi)^{-n / 2} \\ & \quad \times \exp\lp -\frac{1}{2} \lp t - \beta \rp^T X^T \Sigma(\gamma)^{-1}X\lp t - \beta \rp \rp\\ & \quad\times \det \Sigma(\gamma)^{-1/2}\times\exp\lp -\frac{1}{2} a^T (K^T \Sigma(\gamma)K )^{-1}a \rp g(\gamma) \end{aligned} \] We recognize that \(\pi(\beta \mid \gamma, t)\) as a multivariate Gaussian: \[ \pi(\beta \mid \gamma, t) \propto \exp\lp -\frac{1}{2} \lp t - \beta \rp^T X^T \Sigma(\gamma)^{-1}X\lp t - \beta \rp \rp \]

so we multiply and divide by the normalizing constant \(\det(X^T \Sigma(\gamma)^{-1}X)^{1/2} (2\pi)^{-p / 2}\) to get a proper density in \(\beta\) which we can integrate out:

\[ \begin{aligned} \pi(\beta, \gamma\mid t,a) & \propto \pi(\beta \mid \gamma, t) \pi(\gamma \mid a) \\ & \propto \det(X^T X)^{1/2}(2\pi)^{-n / 2} \\ & \quad \times \det(X^T \Sigma(\gamma)^{-1}X)^{1/2} (2\pi)^{-p / 2}\exp\lp -\frac{1}{2} \lp t - \beta \rp^T X^T \Sigma(\gamma)^{-1}X\lp t - \beta \rp \rp\\ & \quad \times \det(X^T \Sigma(\gamma)^{-1}X)^{-1/2} (2\pi)^{p / 2}\det \Sigma(\gamma)^{-1/2}\times\exp\lp -\frac{1}{2} a^T (K^T \Sigma(\gamma)K )^{-1}a \rp g(\gamma) \end{aligned} \]

Integrating \(\beta\) over \(\R^p\) yields \(1\) and the marginal posterior for \(\gamma\):

\[ \begin{aligned} \pi(\gamma\mid a) & \propto \det(X^T X)^{1/2}(2\pi)^{-(n-p) / 2}\det(X^T \Sigma(\gamma)^{-1}X)^{-1/2} \\ & \quad \times \det \Sigma(\gamma)^{-1/2}\times\exp\lp -\frac{1}{2} a^T (K^T \Sigma(\gamma)K )^{-1}a \rp g(\gamma) \end{aligned} \] Given that we have a prior \(g(\gamma)\), the REML likelihood must be what is left:

\[ \begin{aligned} f_a(a \mid \gamma) & \propto \det(X^T X)^{1/2}(2\pi)^{-(n-p) / 2}\det(X^T \Sigma(\gamma)^{-1}X)^{-1/2} \\ & \quad \times \det \Sigma(\gamma)^{-1/2}\times\exp\lp -\frac{1}{2} a^T (K^T \Sigma(\gamma)K )^{-1}a \rp \end{aligned} \]

1.5 Evaluation of joint likelihood at REML estimates

If we evaluate Equation 2 at the REML estimate \((\hat{\beta}, \hat{\gamma})\) that we obtained by maximizing Equation 3 and plugging this into the equation for \(\hat{\beta} \equiv t(\gamma)\) shown in Equation 1 to get \(\hat{\beta}(\hat{\gamma})\) this gives:

\[ \begin{aligned} f_{t,a}(t,a \mid \hat{\beta}, \hat{\gamma}) & = \det(X^T X)^{1/2}(2\pi)^{-n / 2}\det \Sigma(\hat{\gamma})^{-1/2}\exp\lp -\frac{1}{2} \lp t(\hat{\gamma}) - \hat{\beta} \rp^T X^T \Sigma(\hat{\gamma})^{-1}X\lp t(\hat{\gamma}) - \hat{\beta} \rp \rp\\ & \quad \times\exp\lp -\frac{1}{2} a^T (K^T \Sigma(\hat{\gamma})K )^{-1}a \rp \\ & = \det(X^T X)^{1/2}(2\pi)^{-n / 2}\det \Sigma(\hat{\gamma})^{-1/2} \exp\lp -\frac{1}{2} a^T (K^T \Sigma(\hat{\gamma})K )^{-1} a \rp \end{aligned} \]

The log likelihood is:

\[ \begin{aligned} \ell(\hat{\beta}, \hat{\gamma} \mid t, a) & = \frac{1}{2}\log\det(X^T X) -\frac{n}{2} \log(2\pi) -\frac{1}{2}\log\det \Sigma(\hat{\gamma}) -\frac{1}{2} a^T (K^T \Sigma(\hat{\gamma})K )^{-1} a \end{aligned} \]

Thus if we have maximized the REML criterion, shown Equation 3 to give the maximized objective function:

\[ \begin{aligned} \ell(\hat{\gamma}; a) & = -\frac{1}{2}\log\det(X^T \Sigma(\hat{\gamma})^{-1}X) + \frac{1}{2} \log\det(X^T X) -\frac{n-p}{2} \log(2\pi) - \frac{1}{2}\log\det \Sigma(\hat{\gamma})\\ & \quad -\frac{1}{2} a^T (K^T \Sigma(\hat{\gamma})K )^{-1} a \end{aligned} \]

and we wish to evaluate the log-likelihood of the joint distribution for \(y\) at the REML estimates (which is equivalent to the log joint distribution for \((t,a)\)) we can adjust the REML criterion to do so:

\[ \ell(\hat{\beta}, \hat{\gamma} \mid t, a) = \ell(\hat{\gamma}; a) -\frac{p}{2}\log(2\pi) +\frac{1}{2}\log\det(X^T \Sigma(\hat{\gamma})^{-1}X) \]

1.6 Use of REML criterion in model comparison

When we have models of differing fixed effects, we’d like to use the modified REML criterion to compare models. Verbyla (2019) argues that we should compare the joint log-likelihood for \((t,a)\) evaluated at the REML estimates for \(\beta, \gamma\) under each model.

We can see that each model will have a factor of \(-\frac{n}{2}\log(2\pi)\) in the log-likelihood, so this term can be omitted. The determinant of \(X^T X\), however, cannot be omitted.

The REML criterion used in spmodel leaves out \(\frac{1}{2}\log\det X^TX\):

\[ \begin{aligned} \ell_{\texttt{spmodel}}^{\text{REML}}(\gamma \mid a) & = -\frac{n-p}{2}\log(2\pi) -\frac{1}{2}\log\det(X^T \Sigma(\gamma)^{-1}X) -\frac{1}{2}\log\det\Sigma(\gamma) -\frac{1}{2} a^T (K^T \Sigma(\gamma)K )^{-1}a \end{aligned} \]

In order to use \(\ell_{\texttt{spmodel}}^{\text{REML}}(\hat{\gamma} \mid a)\) in model comparison, we need to add back \(\frac{1}{2}\log\det X^T X\) and eliminate \(-\frac{1}{2}\log\det(X^T \Sigma(\gamma)^{-1} X)\), and \(\frac{p}{2}\log(2\pi)\).

This would yield a log-likelihood suitable for use in an AIC:

\[ \begin{aligned} \ell^{\text{REML}}(\hat{\beta},\hat{\gamma} \mid a, t) & = \ell_{\texttt{spmodel}}^{\text{REML}}(\hat{\gamma} \mid a) -\frac{p}{2}\log(2\pi) +\frac{1}{2}\log\det(X^T \Sigma(\hat{\gamma})^{-1}X) + \frac{1}{2}\log\det X^T X \end{aligned} \]

References

Harville, David A. 1974. “Bayesian Inference for Variance Components Using Only Error Contrasts.” Biometrika 61 (2). https://doi.org/10.2307/2334370.
McCulloch, Charles E., Shayle R. Searle, and John M. Neuhaus. 2008. Generalized, Linear, and Mixed Models. 2nd ed. John Wiley & Sons.
Smyth, Gordon K., and Arūnas P. Verbyla. 1996. “A Conditional Likelihood Approach to Residual Maximum Likelihood Estimation in Generalized Linear Models.” Journal of the Royal Statistical Society. Series B (Methodological) 58 (3).
Verbyla, Arunas Petras. 2019. “A Note on Model Selection Using Information Criteria for General Linear Models Estimated Using REML.” Australian & New Zealand Journal of Statistics 61 (1): 39–50. https://doi.org/10.1111/anzs.12254.