Multivariate normal models

Rob Trangucci

Multivariate normal distribution

  • Important to understand how to work with the multivariate normal (MVN) distribution
  • Serves as a useful building block in many models
  • Linear regression, inference on several normal means, time series models, Kalman filters, Bayesian nonparameteric regression can all be formulated as problems in multivariate normal inference
  • We can’t always do so analytically when covariance parameters are unknown, but we’ll still need to be handy with the various ways we can decompose a multivariate normal distribution

Characterizing the MVN

Let \(y\) be a \(d\)-dimensional random variable, written in sampling notation

\[ y \mid \mu, \Sigma \sim \text{MVN}(\mu, \Sigma) \]

The density of \(y\) is

\[ p(y\mid \mu, \Sigma) \propto \det\lp \Sigma \rp^{-1/2}\exp\lp -\frac{1}{2}(y - \mu)^T \Sigma^{-1} (y - \mu) \rp \]

Parameters: \(\Exp{y \mid \mu, \Sigma} = \mu\), \(\Exp{(y - \mu)(y - \mu)^T \mid \mu, \Sigma} = \Sigma\) \(\Sigma\) is positive definite

Equivalences between MVNs

  • Let \(I_d\) be the \(d\)-dimensional identity matrix

  • If \(z \sim \text{MVN}(0, I_d)\), then for an \(d \times d\) nonsingular matrix \(A\) and a \(d\)-dimensional vector \(b\), \(y = b + A z\) is distributed as multivariate normal with mean \(b\) and covariance matrix \(AA^T\).

  • Reminder: Positive definite (p.d.) matrices are characterized by the relation \[\alpha^T B \alpha > 0, \forall \alpha \in \R^d \neq 0 \]

Cholesky decomposition

  • Let \(\Sigma\) be a \(d \times d\) p.d. matrix

  • Then there exists a lower-triangular matrix \(L\) such that \(\Sigma = LL^T\)

  • In R code L <- t(chol(Sigma))

  • Efficient computation of \[\Sigma^{-1} b = (L L^T)^{-1} b = (L^-1)^T L^{-1} b, b \in \R^d\]:

    • Compute \(L\), L <- t(chol(Sigma))

    • Compute \(L^{-1}b\), inv_Lb <- forwardsolve(L, b)

    • Compute \((L^{-1})^T L^{-1} b\) inv_Sigma_b <- backwardsolve(L, inv_Lb)

  • Why is this efficient? Cholesky decomposition has complexity \(n^3 / 3\), and back and forward solving are complexity \(n^2\)

Generating draws from a MVN

d <- 3
S <- 1e5
Sigma <- matrix(
  c(1, 0.9, 0.8,
    0.9, 1, 0.9,
    0.8, 0.9, 1), 
nrow = d,  ncol = d)
L <- t(chol(Sigma))
b <- c(1,2,3)

z <- matrix(
  rnorm(S * d),
  nrow = d, ncol = S
)

y_mean_zero <- L %*% z
y <- sweep(y_mean_zero, 
           MARGIN = 1, 
           STATS = b, 
           FUN = "+")
round(cov(t(y)),2)
     [,1] [,2] [,3]
[1,]  1.0  0.9  0.8
[2,]  0.9  1.0  0.9
[3,]  0.8  0.9  1.0
round(rowMeans(y),2)
[1] 0.99 2.00 3.00

More useful facts

  • \(d\)-dim \(X\) is MVN if \(\alpha^T X\) is univariate normal for all \(\alpha \in \R^d\)

  • \(X_j\) and \(X_i\) are independent if and only if \(\Sigma_{ij} = 0\)

  • Linear transformations of MVNs are MVN:

    • Let \(d\)-dim \(X\) be dstributed \(\text{MVN}(\mu, \Sigma)\)

    • Let \(M\) be a \(k \times d\) matrix and let \(b\) be a \(k\)-vector

    • Define \(Y = b + MX\)

    • Then \(Y \sim \text{MVN}(b + M \mu, B\Sigma B^T)\)

Partitions of MVN

  • Let \(Y = (Y_1, Y_2)\), \(Y_1\) \(d_1\)-dim and \(Y_2\) \(d_2\) dim with the following distribution:

\[ \begin{bmatrix} Y_1 \\ Y_2 \end{bmatrix} \sim \text{MVN}\lp\begin{bmatrix} \mu_1 \\ \mu_2 \end{bmatrix}, \begin{bmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{bmatrix} \rp \]

  • \(Y_1\) is marginally MVN with \(\mu_1\) and \(\Sigma_{11}\)

Can show this by defining \(Y_1 = b + M Y\)

\[ \begin{aligned} b=0 && M = \begin{bmatrix} I_{d_1} & 0_{d_1 \times d_2} \end{bmatrix} \end{aligned} \]

Dist of \(Y_1 \mid Y_2\)

  • Let \(Y_1\) and \(Y_2\) be jointly MVN as in the last slide

  • Let \(X_1 = Y_1 - \Sigma_{12}\Sigma_{22}^{-1} Y_2\)

  • \(X_1 \indy Y_2\) (How can we show this?)

  • \(\text{Cov}(X_1) = \text{Cov}(Y_1 - \Sigma_{12}\Sigma_{22}^{-1} Y_2, Y_1 - \Sigma_{12}\Sigma_{22}^{-1} Y_2)\)

\[ \begin{aligned} & \text{Cov}(Y_1) - \text{Cov}(\Sigma_{12}\Sigma_{22}^{-1} Y_2,Y_1) - \text{Cov}(Y_1,\Sigma_{12}\Sigma_{22}^{-1} Y_2) + \text{Cov}(\Sigma_{12}\Sigma_{22}^{-1} Y_2)\\ & = \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} \end{aligned} \]

  • \(\Exp{X_1} = \mu_1 - \Sigma_{12}\Sigma_{22}^{-1} \mu_2\)

  • Now write \(Y_1 = X_1 + \Sigma_{12}\Sigma_{22}^{-1} Y_2\)

  • Condition \(Y_1\) on \(Y_2 = y\)

  • Note that \(X_1\) doesn’t depend on \(Y_2\) due to \(\indy\), so its distribution is unchanged by conditioning

  • Then \(Y_1\) is just \(X_1\) shifted by \(+\Sigma_{12}\Sigma_{22}^{-1} y\)

\[ Y_1 \mid Y_2 = y \sim \text{MVN}(\mu_1 + \Sigma_{12}\Sigma_{22}^{-1}(y - \mu_2), \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}) \]

Back to Bayes-ics

  • Suppose we have \(y_1 \mid \mu \sim \text{MNV}(\mu, \Sigma)\) with \(\Sigma\) known

  • Furthermore, let \(\mu \sim \text{MVN}(\mu_0, \Lambda_0)\)

  • We can write \(y_1 = \mu + z\) where \(z \sim \text{MVN}(0, \Sigma)\), \(z \indy \mu\) (sort of the converse of what we just proved in the last slide)

  • Then using the fact that \(\text{Cov}(y_1,\mu) = \Lambda\), \(\Exp{y_1} = \mu_0\) and \(\text{Cov}(y_1) = \Sigma + \Lambda_0\), we can write

\[ \begin{bmatrix} y_1 \\ \mu \end{bmatrix} \sim \text{MVN}\lp \begin{bmatrix} \mu_0 \\ \mu_0 \end{bmatrix}, \begin{bmatrix} \Sigma + \Lambda_0 & \Lambda_0 \\ \Lambda_0 & \Lambda_0 \end{bmatrix} \rp \]

  • This allows us to use our facts about conditional distributions to get the posterior for \(\mu \mid y_1\):

\[ \begin{aligned} \mu \mid y_1 \sim \text{MVN}(& \mu_0 + \Lambda_0(\Sigma + \Lambda_0)^{-1}(y_1 - \mu_0) ,\\ &\Lambda_0 - \Lambda_0(\Sigma + \Lambda_0)^{-1}\Lambda_0 ) \end{aligned} \]

  • To get this to look more like the book’s results, there is a matrix identity:

\[ (A + B)^{-1} = A^{-1} - A^{-1}(B^{-1} + A^{-1})^{-1}A^{-1} \]

  • Using this result we get a different form of the covariance matrix:

\[ \Lambda_0 - \Lambda_0(\Sigma + \Lambda_0)^{-1}\Lambda_0 = (\Lambda_0^{-1} + \Sigma^{-1})^{-1} \]

  • We can also rewrite the conditional mean \(\Lambda_0(\Sigma + \Lambda_0)^{-1}\) by multiplying the above on the right by \(\Lambda_0^{-1}\) as

\[ \Lambda_0(\Sigma + \Lambda_0)^{-1} = I - (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Lambda_0^{-1} \]

  • Which yields a conditional mean

\[ \begin{aligned} & y - (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Lambda_0^{-1}(y - \mu_0) \\ & = (I - (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Lambda_0^{-1})y + (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Lambda_0^{-1}\mu_0 \end{aligned} \]

  • We note the following

\[ \begin{aligned} I & = (\Lambda_0^{-1} + \Sigma^{-1})^{-1}(\Sigma^{-1} + \Lambda_0^{-1})\\ I & = (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Sigma^{-1} + (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Lambda_0^{-1}\\ I - (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Lambda_0^{-1} & = (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Sigma^{-1} \end{aligned} \]

  • Finally yielding

\[ \begin{aligned} & (I - (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Lambda_0^{-1})y + (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Lambda_0^{-1}\mu_0 \\ & = (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Sigma^{-1} y + (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Lambda_0^{-1}\mu_0 \\ & = (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\lp\Sigma^{-1} y + \Lambda_0^{-1}\mu_0 \rp \end{aligned} \]

With \(n\) observations

  • We can use the fact that \(\bar{y} \mid \mu \sim \text{MVN}(\mu, n^{-1}\Sigma)\) and use the previous results

\[ \begin{aligned} \mu \mid \bar{y} \sim \text{MVN}(&(\Lambda_0^{-1} + n\Sigma^{-1})^{-1}\lp n\Sigma^{-1} \bar{y} + \Lambda_0^{-1}\mu_0 \rp,\\ & (\Lambda_0^{-1} + n\Sigma^{-1})^{-1}) \end{aligned} \]

  • This parallels the univariate normal results (recall):

\[ \mu \mid \bar{y} \sim \text{Normal}(\lp n\frac{\bar{y}}{\sigma} + \frac{\mu_0}{\tau_0}\rp(\frac{n}{\sigma} + \frac{1}{\tau_0})^{-1}, (\frac{n}{\sigma} + \frac{1}{\tau_0})^{-1}) \]

Bayesian Linear regression

  • Often we have covariates \(x_i \in \R^p\) paired with our observations \(y_i\) which we want to use to better describe the distribution of \(y_i\)

  • Let \((x_i, y_i), i=1, \dots, n\) comprise the observed data

  • Let \(x\) be the \(n \times p\) matrix where the \(i^\mathrm{th}\) row is \(x_i\) and let \(y\) be the \(n\)-vector with the \(i^\mathrm{th}\) element equal to \(y_i\)

  • A full model; for \((x, y)\) would be be a likelihood \(p(x,y \mid \theta)\), along with a prior \(p(\theta)\). Let \(\theta = (\psi, \phi)\) where \(\psi\) governs the conditional distribution \(p(x \mid \psi)\) and \(\phi\) enters into the conditional distribution \(p(y \mid x, \phi, \psi)\)

\[ \begin{align} p(\theta \mid x, y) & \propto p(x, y \mid \theta) p(\theta) \\ & \propto p(x \mid \psi) p(y \mid x, \phi, \psi)p(\phi, \psi) \end{align} \]

  • In general, \(p(y \mid x, \phi, \psi)\) can depend on \(\psi\) and \(\phi \not \indy \psi\) apriori

  • If \(p(y \mid x, \phi, \psi) = p(y \mid x, \phi)\) and \(p(\phi, \psi) = p(\phi)p(\psi)\) we get a nice property

\[ \begin{align} p(\theta \mid x, y) & \propto p(x \mid \textcolor{green}{\psi}) p(y \mid x, \phi, \psi)p(\phi, \psi) \\ & \propto p(x \mid \psi) p(y \mid x, \phi)p(\phi)p(\psi) \\ & \propto [p(x \mid \psi) p(\psi)] [p(y \mid x, \phi)p(\phi)] \\ & \propto [p(\psi \mid x)][p(\phi \mid y, x)] \end{align} \]

The usual model

  • For \(i=1, \dots, n\)

\[ y_i \sim \text{Normal}(x_i^T \beta, \sigma^2) \]

  • Then \(\phi = (\beta, \sigma^2)\)

  • Conjugate priors for \(\sigma^2, \beta\)

\[ p(\sigma^2) \propto (\sigma^2)^{-(\nu_0/2 + 1)} \exp\lp -\frac{\nu_0 s_0}{2 \sigma^2} \rp \]

\[ \begin{aligned} p(\beta \mid \sigma^2) \propto (\sigma^2)^{-p / 2} \exp\lp-\frac{1}{2\sigma^2}(\beta - \mu_0)^T \Sigma_0^{-1}(\beta - \mu_0) \rp \end{aligned} \]

Dealing with the likelihood

  • The data are conditionally independent by \(i\), implying

\[ p(y \mid x, \beta, \sigma^2) \propto \prod_{i=1}^n (\sigma^2)^{-1/2} \exp\lp-\frac{1}{2\sigma^2}(y_i - x_i^T \beta) \rp^2 \]

  • We can of course combine the product

\[ p(y \mid x, \beta, \sigma^2) \propto (\sigma^2)^{-n/2} \exp\lp-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i - x_i^T \beta) \rp^2 \]

  • We can use the fact that, jointly, \(y \sim \text{MVN}(x \beta, \sigma^2 I_n)\) to write our likelihood more succinctly

\[ p(y \mid x, \beta, \sigma^2) \propto (\sigma^2)^{-n/2} \exp\lp-\frac{1}{2\sigma^2}(y - x \beta)^T(y - x \beta) \rp \]

The posterior

  • We now express our posterior the way we have so far:

\[ \begin{aligned} p(\beta, \sigma^2 \mid y, x) & \propto (\sigma^2)^{-n/2} \exp\lp-\frac{1}{2\sigma^2}(y - x \beta)^T(y - x \beta) \rp \\ & \quad \times (\sigma^2)^{-p / 2} \exp\lp-\frac{1}{2\sigma^2}(\beta - \mu_0)^T \Sigma_0^{-1}(\beta - \mu_0) \rp \\ & \quad \times (\sigma^2)^{-(\nu_0/2 + 1)} \exp\lp -\frac{\nu_0 s_0}{2 \sigma^2} \rp \end{aligned} \]

  • We can sort of squint at this expression and glean that we’ll get some sort of normal distribution conditional on \(\sigma^2\) and a marginal distribution over \(\sigma^2\), but we don’t know what the parameters will be of those posteriors

  • The first thing we’ll need to do is to combine the expression for \(\beta\) in the prior and in the likelihood into one term

\[ \begin{aligned} & \exp\lp-\frac{1}{2\sigma^2}(y - x \beta)^T(y - x \beta) \rp\exp\lp-\frac{1}{2\sigma^2}(\beta - \mu_0)^T \Sigma_0^{-1}(\beta - \mu_0) \rp \\ & =\exp\lp-\frac{1}{2\sigma^2}\lp(y - x \beta)^T(y - x \beta) + (\beta - \mu_0)^T \Sigma_0^{-1}(\beta - \mu_0) \rp \rp \end{aligned} \]

  • For the rest of the derivation, see my notes here

References