Multivariate normal models

Rob Trangucci

Multivariate normal distribution

Important to understand how to work with the multivariate normal (MVN) distribution
Serves as a useful building block in many models
Linear regression, inference on several normal means, time series models, Kalman filters, Bayesian nonparameteric regression can all be formulated as problems in multivariate normal inference
We can’t always do so analytically when covariance parameters are unknown, but we’ll still need to be handy with the various ways we can decompose a multivariate normal distribution

Characterizing the MVN

Let \(y\) be a \(d\)-dimensional random variable, written in sampling notation

\[ y \mid \mu, \Sigma \sim \text{MVN}(\mu, \Sigma) \]

The density of \(y\) is

\[ p(y\mid \mu, \Sigma) \propto \det\lp \Sigma \rp^{-1/2}\exp\lp -\frac{1}{2}(y - \mu)^T \Sigma^{-1} (y - \mu) \rp \]

Parameters: \(\Exp{y \mid \mu, \Sigma} = \mu\), \(\Exp{(y - \mu)(y - \mu)^T \mid \mu, \Sigma} = \Sigma\) \(\Sigma\) is positive definite

Equivalences between MVNs

Let \(I_d\) be the \(d\)-dimensional identity matrix
If \(z \sim \text{MVN}(0, I_d)\), then for an \(d \times d\) nonsingular matrix \(A\) and a \(d\)-dimensional vector \(b\), \(y = b + A z\) is distributed as multivariate normal with mean \(b\) and covariance matrix \(AA^T\).
Reminder: Positive definite (p.d.) matrices are characterized by the relation \[\alpha^T B \alpha > 0, \forall \alpha \in \R^d \neq 0 \]

Cholesky decomposition

Let \(\Sigma\) be a \(d \times d\) p.d. matrix
Then there exists a lower-triangular matrix \(L\) such that \(\Sigma = LL^T\)
In R code L <- t(chol(Sigma))
Efficient computation of \[\Sigma^{-1} b = (L L^T)^{-1} b = (L^-1)^T L^{-1} b, b \in \R^d\]:
- Compute \(L\), L <- t(chol(Sigma))
- Compute \(L^{-1}b\), inv_Lb <- forwardsolve(L, b)
- Compute \((L^{-1})^T L^{-1} b\) inv_Sigma_b <- backwardsolve(L, inv_Lb)
Why is this efficient? Cholesky decomposition has complexity \(n^3 / 3\), and back and forward solving are complexity \(n^2\)

Generating draws from a MVN

d <- 3
S <- 1e5
Sigma <- matrix(
  c(1, 0.9, 0.8,
    0.9, 1, 0.9,
    0.8, 0.9, 1), 
nrow = d,  ncol = d)
L <- t(chol(Sigma))
b <- c(1,2,3)

z <- matrix(
  rnorm(S * d),
  nrow = d, ncol = S
)

y_mean_zero <- L %*% z
y <- sweep(y_mean_zero, 
           MARGIN = 1, 
           STATS = b, 
           FUN = "+")

round(cov(t(y)),2)

     [,1] [,2] [,3]
[1,]  1.0  0.9  0.8
[2,]  0.9  1.0  0.9
[3,]  0.8  0.9  1.0

round(rowMeans(y),2)

[1] 0.99 2.00 3.00

More useful facts

\(d\)-dim \(X\) is MVN if \(\alpha^T X\) is univariate normal for all \(\alpha \in \R^d\)
\(X_j\) and \(X_i\) are independent if and only if \(\Sigma_{ij} = 0\)
Linear transformations of MVNs are MVN:
- Let \(d\)-dim \(X\) be dstributed \(\text{MVN}(\mu, \Sigma)\)
- Let \(M\) be a \(k \times d\) matrix and let \(b\) be a \(k\)-vector
- Define \(Y = b + MX\)
- Then \(Y \sim \text{MVN}(b + M \mu, B\Sigma B^T)\)

Partitions of MVN

Let \(Y = (Y_1, Y_2)\), \(Y_1\) \(d_1\)-dim and \(Y_2\) \(d_2\) dim with the following distribution:

\[ \begin{bmatrix} Y_1 \\ Y_2 \end{bmatrix} \sim \text{MVN}\lp\begin{bmatrix} \mu_1 \\ \mu_2 \end{bmatrix}, \begin{bmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{bmatrix} \rp \]

\(Y_1\) is marginally MVN with \(\mu_1\) and \(\Sigma_{11}\)

Can show this by defining \(Y_1 = b + M Y\)

\[ \begin{aligned} b=0 && M = \begin{bmatrix} I_{d_1} & 0_{d_1 \times d_2} \end{bmatrix} \end{aligned} \]

Dist of \(Y_1 \mid Y_2\)

Let \(Y_1\) and \(Y_2\) be jointly MVN as in the last slide
Let \(X_1 = Y_1 - \Sigma_{12}\Sigma_{22}^{-1} Y_2\)
\(X_1 \indy Y_2\) (How can we show this?)
\(\text{Cov}(X_1) = \text{Cov}(Y_1 - \Sigma_{12}\Sigma_{22}^{-1} Y_2, Y_1 - \Sigma_{12}\Sigma_{22}^{-1} Y_2)\)

\[ \begin{aligned} & \text{Cov}(Y_1) - \text{Cov}(\Sigma_{12}\Sigma_{22}^{-1} Y_2,Y_1) - \text{Cov}(Y_1,\Sigma_{12}\Sigma_{22}^{-1} Y_2) + \text{Cov}(\Sigma_{12}\Sigma_{22}^{-1} Y_2)\\ & = \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} \end{aligned} \]

\(\Exp{X_1} = \mu_1 - \Sigma_{12}\Sigma_{22}^{-1} \mu_2\)
Now write \(Y_1 = X_1 + \Sigma_{12}\Sigma_{22}^{-1} Y_2\)
Condition \(Y_1\) on \(Y_2 = y\)
Note that \(X_1\) doesn’t depend on \(Y_2\) due to \(\indy\), so its distribution is unchanged by conditioning
Then \(Y_1\) is just \(X_1\) shifted by \(+\Sigma_{12}\Sigma_{22}^{-1} y\)

\[ Y_1 \mid Y_2 = y \sim \text{MVN}(\mu_1 + \Sigma_{12}\Sigma_{22}^{-1}(y - \mu_2), \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}) \]

Back to Bayes-ics

Suppose we have \(y_1 \mid \mu \sim \text{MNV}(\mu, \Sigma)\) with \(\Sigma\) known
Furthermore, let \(\mu \sim \text{MVN}(\mu_0, \Lambda_0)\)
We can write \(y_1 = \mu + z\) where \(z \sim \text{MVN}(0, \Sigma)\), \(z \indy \mu\) (sort of the converse of what we just proved in the last slide)
Then using the fact that \(\text{Cov}(y_1,\mu) = \Lambda\), \(\Exp{y_1} = \mu_0\) and \(\text{Cov}(y_1) = \Sigma + \Lambda_0\), we can write

\[ \begin{bmatrix} y_1 \\ \mu \end{bmatrix} \sim \text{MVN}\lp \begin{bmatrix} \mu_0 \\ \mu_0 \end{bmatrix}, \begin{bmatrix} \Sigma + \Lambda_0 & \Lambda_0 \\ \Lambda_0 & \Lambda_0 \end{bmatrix} \rp \]

This allows us to use our facts about conditional distributions to get the posterior for \(\mu \mid y_1\):

\[ \begin{aligned} \mu \mid y_1 \sim \text{MVN}(& \mu_0 + \Lambda_0(\Sigma + \Lambda_0)^{-1}(y_1 - \mu_0) ,\\ &\Lambda_0 - \Lambda_0(\Sigma + \Lambda_0)^{-1}\Lambda_0 ) \end{aligned} \]

To get this to look more like the book’s results, there is a matrix identity:

\[ (A + B)^{-1} = A^{-1} - A^{-1}(B^{-1} + A^{-1})^{-1}A^{-1} \]

Using this result we get a different form of the covariance matrix:

\[ \Lambda_0 - \Lambda_0(\Sigma + \Lambda_0)^{-1}\Lambda_0 = (\Lambda_0^{-1} + \Sigma^{-1})^{-1} \]

We can also rewrite the conditional mean \(\Lambda_0(\Sigma + \Lambda_0)^{-1}\) by multiplying the above on the right by \(\Lambda_0^{-1}\) as

\[ \Lambda_0(\Sigma + \Lambda_0)^{-1} = I - (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Lambda_0^{-1} \]

Which yields a conditional mean

\[ \begin{aligned} & y - (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Lambda_0^{-1}(y - \mu_0) \\ & = (I - (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Lambda_0^{-1})y + (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Lambda_0^{-1}\mu_0 \end{aligned} \]

We note the following

\[ \begin{aligned} I & = (\Lambda_0^{-1} + \Sigma^{-1})^{-1}(\Sigma^{-1} + \Lambda_0^{-1})\\ I & = (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Sigma^{-1} + (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Lambda_0^{-1}\\ I - (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Lambda_0^{-1} & = (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Sigma^{-1} \end{aligned} \]

Finally yielding

\[ \begin{aligned} & (I - (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Lambda_0^{-1})y + (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Lambda_0^{-1}\mu_0 \\ & = (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Sigma^{-1} y + (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Lambda_0^{-1}\mu_0 \\ & = (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\lp\Sigma^{-1} y + \Lambda_0^{-1}\mu_0 \rp \end{aligned} \]

With \(n\) observations

We can use the fact that \(\bar{y} \mid \mu \sim \text{MVN}(\mu, n^{-1}\Sigma)\) and use the previous results

\[ \begin{aligned} \mu \mid \bar{y} \sim \text{MVN}(&(\Lambda_0^{-1} + n\Sigma^{-1})^{-1}\lp n\Sigma^{-1} \bar{y} + \Lambda_0^{-1}\mu_0 \rp,\\ & (\Lambda_0^{-1} + n\Sigma^{-1})^{-1}) \end{aligned} \]

This parallels the univariate normal results (recall):

\[ \mu \mid \bar{y} \sim \text{Normal}(\lp n\frac{\bar{y}}{\sigma} + \frac{\mu_0}{\tau_0}\rp(\frac{n}{\sigma} + \frac{1}{\tau_0})^{-1}, (\frac{n}{\sigma} + \frac{1}{\tau_0})^{-1}) \]

Bayesian Linear regression

Often we have covariates \(x_i \in \R^p\) paired with our observations \(y_i\) which we want to use to better describe the distribution of \(y_i\)
Let \((x_i, y_i), i=1, \dots, n\) comprise the observed data
Let \(x\) be the \(n \times p\) matrix where the \(i^\mathrm{th}\) row is \(x_i\) and let \(y\) be the \(n\)-vector with the \(i^\mathrm{th}\) element equal to \(y_i\)
A full model; for \((x, y)\) would be be a likelihood \(p(x,y \mid \theta)\), along with a prior \(p(\theta)\). Let \(\theta = (\psi, \phi)\) where \(\psi\) governs the conditional distribution \(p(x \mid \psi)\) and \(\phi\) enters into the conditional distribution \(p(y \mid x, \phi, \psi)\)

\[ \begin{align} p(\theta \mid x, y) & \propto p(x, y \mid \theta) p(\theta) \\ & \propto p(x \mid \psi) p(y \mid x, \phi, \psi)p(\phi, \psi) \end{align} \]

In general, \(p(y \mid x, \phi, \psi)\) can depend on \(\psi\) and \(\phi \not \indy \psi\) apriori
If \(p(y \mid x, \phi, \psi) = p(y \mid x, \phi)\) and \(p(\phi, \psi) = p(\phi)p(\psi)\) we get a nice property

\[ \begin{align} p(\theta \mid x, y) & \propto p(x \mid \textcolor{green}{\psi}) p(y \mid x, \phi, \psi)p(\phi, \psi) \\ & \propto p(x \mid \psi) p(y \mid x, \phi)p(\phi)p(\psi) \\ & \propto [p(x \mid \psi) p(\psi)] [p(y \mid x, \phi)p(\phi)] \\ & \propto [p(\psi \mid x)][p(\phi \mid y, x)] \end{align} \]

The usual model

For \(i=1, \dots, n\)

\[ y_i \sim \text{Normal}(x_i^T \beta, \sigma^2) \]

Then \(\phi = (\beta, \sigma^2)\)
Conjugate priors for \(\sigma^2, \beta\)

\[ p(\sigma^2) \propto (\sigma^2)^{-(\nu_0/2 + 1)} \exp\lp -\frac{\nu_0 s_0}{2 \sigma^2} \rp \]

\[ \begin{aligned} p(\beta \mid \sigma^2) \propto (\sigma^2)^{-p / 2} \exp\lp-\frac{1}{2\sigma^2}(\beta - \mu_0)^T \Sigma_0^{-1}(\beta - \mu_0) \rp \end{aligned} \]

Dealing with the likelihood

The data are conditionally independent by \(i\), implying

\[ p(y \mid x, \beta, \sigma^2) \propto \prod_{i=1}^n (\sigma^2)^{-1/2} \exp\lp-\frac{1}{2\sigma^2}(y_i - x_i^T \beta) \rp^2 \]

We can of course combine the product

\[ p(y \mid x, \beta, \sigma^2) \propto (\sigma^2)^{-n/2} \exp\lp-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i - x_i^T \beta) \rp^2 \]

We can use the fact that, jointly, \(y \sim \text{MVN}(x \beta, \sigma^2 I_n)\) to write our likelihood more succinctly

\[ p(y \mid x, \beta, \sigma^2) \propto (\sigma^2)^{-n/2} \exp\lp-\frac{1}{2\sigma^2}(y - x \beta)^T(y - x \beta) \rp \]

The posterior

We now express our posterior the way we have so far:

\[ \begin{aligned} p(\beta, \sigma^2 \mid y, x) & \propto (\sigma^2)^{-n/2} \exp\lp-\frac{1}{2\sigma^2}(y - x \beta)^T(y - x \beta) \rp \\ & \quad \times (\sigma^2)^{-p / 2} \exp\lp-\frac{1}{2\sigma^2}(\beta - \mu_0)^T \Sigma_0^{-1}(\beta - \mu_0) \rp \\ & \quad \times (\sigma^2)^{-(\nu_0/2 + 1)} \exp\lp -\frac{\nu_0 s_0}{2 \sigma^2} \rp \end{aligned} \]

We can sort of squint at this expression and glean that we’ll get some sort of normal distribution conditional on \(\sigma^2\) and a marginal distribution over \(\sigma^2\), but we don’t know what the parameters will be of those posteriors
The first thing we’ll need to do is to combine the expression for \(\beta\) in the prior and in the likelihood into one term

\[ \begin{aligned} & \exp\lp-\frac{1}{2\sigma^2}(y - x \beta)^T(y - x \beta) \rp\exp\lp-\frac{1}{2\sigma^2}(\beta - \mu_0)^T \Sigma_0^{-1}(\beta - \mu_0) \rp \\ & =\exp\lp-\frac{1}{2\sigma^2}\lp(y - x \beta)^T(y - x \beta) + (\beta - \mu_0)^T \Sigma_0^{-1}(\beta - \mu_0) \rp \rp \end{aligned} \]

For the rest of the derivation, see my notes here