Let \(y\) be a \(d\)-dimensional random variable, written in sampling notation
\[ y \mid \mu, \Sigma \sim \text{MVN}(\mu, \Sigma) \]
The density of \(y\) is
\[ p(y\mid \mu, \Sigma) \propto \det\lp \Sigma \rp^{-1/2}\exp\lp -\frac{1}{2}(y - \mu)^T \Sigma^{-1} (y - \mu) \rp \]
Parameters: \(\Exp{y \mid \mu, \Sigma} = \mu\), \(\Exp{(y - \mu)(y - \mu)^T \mid \mu, \Sigma} = \Sigma\) \(\Sigma\) is positive definite
Let \(I_d\) be the \(d\)-dimensional identity matrix
If \(z \sim \text{MVN}(0, I_d)\), then for an \(d \times d\) nonsingular matrix \(A\) and a \(d\)-dimensional vector \(b\), \(y = b + A z\) is distributed as multivariate normal with mean \(b\) and covariance matrix \(AA^T\).
Reminder: Positive definite (p.d.) matrices are characterized by the relation \[\alpha^T B \alpha > 0, \forall \alpha \in \R^d \neq 0 \]
Let \(\Sigma\) be a \(d \times d\) p.d. matrix
Then there exists a lower-triangular matrix \(L\) such that \(\Sigma = LL^T\)
In R code L <- t(chol(Sigma))
Efficient computation of \[\Sigma^{-1} b = (L L^T)^{-1} b = (L^-1)^T L^{-1} b, b \in \R^d\]:
Compute \(L\), L <- t(chol(Sigma))
Compute \(L^{-1}b\), inv_Lb <- forwardsolve(L, b)
Compute \((L^{-1})^T L^{-1} b\) inv_Sigma_b <- backwardsolve(L, inv_Lb)
Why is this efficient? Cholesky decomposition has complexity \(n^3 / 3\), and back and forward solving are complexity \(n^2\)
\(d\)-dim \(X\) is MVN if \(\alpha^T X\) is univariate normal for all \(\alpha \in \R^d\)
\(X_j\) and \(X_i\) are independent if and only if \(\Sigma_{ij} = 0\)
Linear transformations of MVNs are MVN:
Let \(d\)-dim \(X\) be dstributed \(\text{MVN}(\mu, \Sigma)\)
Let \(M\) be a \(k \times d\) matrix and let \(b\) be a \(k\)-vector
Define \(Y = b + MX\)
Then \(Y \sim \text{MVN}(b + M \mu, B\Sigma B^T)\)
\[ \begin{bmatrix} Y_1 \\ Y_2 \end{bmatrix} \sim \text{MVN}\lp\begin{bmatrix} \mu_1 \\ \mu_2 \end{bmatrix}, \begin{bmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{bmatrix} \rp \]
Can show this by defining \(Y_1 = b + M Y\)
\[ \begin{aligned} b=0 && M = \begin{bmatrix} I_{d_1} & 0_{d_1 \times d_2} \end{bmatrix} \end{aligned} \]
Let \(Y_1\) and \(Y_2\) be jointly MVN as in the last slide
Let \(X_1 = Y_1 - \Sigma_{12}\Sigma_{22}^{-1} Y_2\)
\(X_1 \indy Y_2\) (How can we show this?)
\(\text{Cov}(X_1) = \text{Cov}(Y_1 - \Sigma_{12}\Sigma_{22}^{-1} Y_2, Y_1 - \Sigma_{12}\Sigma_{22}^{-1} Y_2)\)
\[ \begin{aligned} & \text{Cov}(Y_1) - \text{Cov}(\Sigma_{12}\Sigma_{22}^{-1} Y_2,Y_1) - \text{Cov}(Y_1,\Sigma_{12}\Sigma_{22}^{-1} Y_2) + \text{Cov}(\Sigma_{12}\Sigma_{22}^{-1} Y_2)\\ & = \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} \end{aligned} \]
\(\Exp{X_1} = \mu_1 - \Sigma_{12}\Sigma_{22}^{-1} \mu_2\)
Now write \(Y_1 = X_1 + \Sigma_{12}\Sigma_{22}^{-1} Y_2\)
Condition \(Y_1\) on \(Y_2 = y\)
Note that \(X_1\) doesn’t depend on \(Y_2\) due to \(\indy\), so its distribution is unchanged by conditioning
Then \(Y_1\) is just \(X_1\) shifted by \(+\Sigma_{12}\Sigma_{22}^{-1} y\)
\[ Y_1 \mid Y_2 = y \sim \text{MVN}(\mu_1 + \Sigma_{12}\Sigma_{22}^{-1}(y - \mu_2), \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}) \]
Suppose we have \(y_1 \mid \mu \sim \text{MNV}(\mu, \Sigma)\) with \(\Sigma\) known
Furthermore, let \(\mu \sim \text{MVN}(\mu_0, \Lambda_0)\)
We can write \(y_1 = \mu + z\) where \(z \sim \text{MVN}(0, \Sigma)\), \(z \indy \mu\) (sort of the converse of what we just proved in the last slide)
Then using the fact that \(\text{Cov}(y_1,\mu) = \Lambda\), \(\Exp{y_1} = \mu_0\) and \(\text{Cov}(y_1) = \Sigma + \Lambda_0\), we can write
\[ \begin{bmatrix} y_1 \\ \mu \end{bmatrix} \sim \text{MVN}\lp \begin{bmatrix} \mu_0 \\ \mu_0 \end{bmatrix}, \begin{bmatrix} \Sigma + \Lambda_0 & \Lambda_0 \\ \Lambda_0 & \Lambda_0 \end{bmatrix} \rp \]
\[ \begin{aligned} \mu \mid y_1 \sim \text{MVN}(& \mu_0 + \Lambda_0(\Sigma + \Lambda_0)^{-1}(y_1 - \mu_0) ,\\ &\Lambda_0 - \Lambda_0(\Sigma + \Lambda_0)^{-1}\Lambda_0 ) \end{aligned} \]
\[ (A + B)^{-1} = A^{-1} - A^{-1}(B^{-1} + A^{-1})^{-1}A^{-1} \]
\[ \Lambda_0 - \Lambda_0(\Sigma + \Lambda_0)^{-1}\Lambda_0 = (\Lambda_0^{-1} + \Sigma^{-1})^{-1} \]
\[ \Lambda_0(\Sigma + \Lambda_0)^{-1} = I - (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Lambda_0^{-1} \]
\[ \begin{aligned} & y - (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Lambda_0^{-1}(y - \mu_0) \\ & = (I - (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Lambda_0^{-1})y + (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Lambda_0^{-1}\mu_0 \end{aligned} \]
\[ \begin{aligned} I & = (\Lambda_0^{-1} + \Sigma^{-1})^{-1}(\Sigma^{-1} + \Lambda_0^{-1})\\ I & = (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Sigma^{-1} + (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Lambda_0^{-1}\\ I - (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Lambda_0^{-1} & = (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Sigma^{-1} \end{aligned} \]
\[ \begin{aligned} & (I - (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Lambda_0^{-1})y + (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Lambda_0^{-1}\mu_0 \\ & = (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Sigma^{-1} y + (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\Lambda_0^{-1}\mu_0 \\ & = (\Lambda_0^{-1} + \Sigma^{-1})^{-1}\lp\Sigma^{-1} y + \Lambda_0^{-1}\mu_0 \rp \end{aligned} \]
\[ \begin{aligned} \mu \mid \bar{y} \sim \text{MVN}(&(\Lambda_0^{-1} + n\Sigma^{-1})^{-1}\lp n\Sigma^{-1} \bar{y} + \Lambda_0^{-1}\mu_0 \rp,\\ & (\Lambda_0^{-1} + n\Sigma^{-1})^{-1}) \end{aligned} \]
\[ \mu \mid \bar{y} \sim \text{Normal}(\lp n\frac{\bar{y}}{\sigma} + \frac{\mu_0}{\tau_0}\rp(\frac{n}{\sigma} + \frac{1}{\tau_0})^{-1}, (\frac{n}{\sigma} + \frac{1}{\tau_0})^{-1}) \]
Often we have covariates \(x_i \in \R^p\) paired with our observations \(y_i\) which we want to use to better describe the distribution of \(y_i\)
Let \((x_i, y_i), i=1, \dots, n\) comprise the observed data
Let \(x\) be the \(n \times p\) matrix where the \(i^\mathrm{th}\) row is \(x_i\) and let \(y\) be the \(n\)-vector with the \(i^\mathrm{th}\) element equal to \(y_i\)
A full model; for \((x, y)\) would be be a likelihood \(p(x,y \mid \theta)\), along with a prior \(p(\theta)\). Let \(\theta = (\psi, \phi)\) where \(\psi\) governs the conditional distribution \(p(x \mid \psi)\) and \(\phi\) enters into the conditional distribution \(p(y \mid x, \phi, \psi)\)
\[ \begin{align} p(\theta \mid x, y) & \propto p(x, y \mid \theta) p(\theta) \\ & \propto p(x \mid \psi) p(y \mid x, \phi, \psi)p(\phi, \psi) \end{align} \]
In general, \(p(y \mid x, \phi, \psi)\) can depend on \(\psi\) and \(\phi \not \indy \psi\) apriori
If \(p(y \mid x, \phi, \psi) = p(y \mid x, \phi)\) and \(p(\phi, \psi) = p(\phi)p(\psi)\) we get a nice property
\[ \begin{align} p(\theta \mid x, y) & \propto p(x \mid \textcolor{green}{\psi}) p(y \mid x, \phi, \psi)p(\phi, \psi) \\ & \propto p(x \mid \psi) p(y \mid x, \phi)p(\phi)p(\psi) \\ & \propto [p(x \mid \psi) p(\psi)] [p(y \mid x, \phi)p(\phi)] \\ & \propto [p(\psi \mid x)][p(\phi \mid y, x)] \end{align} \]
\[ y_i \sim \text{Normal}(x_i^T \beta, \sigma^2) \]
Then \(\phi = (\beta, \sigma^2)\)
Conjugate priors for \(\sigma^2, \beta\)
\[ p(\sigma^2) \propto (\sigma^2)^{-(\nu_0/2 + 1)} \exp\lp -\frac{\nu_0 s_0}{2 \sigma^2} \rp \]
\[ \begin{aligned} p(\beta \mid \sigma^2) \propto (\sigma^2)^{-p / 2} \exp\lp-\frac{1}{2\sigma^2}(\beta - \mu_0)^T \Sigma_0^{-1}(\beta - \mu_0) \rp \end{aligned} \]
\[ p(y \mid x, \beta, \sigma^2) \propto \prod_{i=1}^n (\sigma^2)^{-1/2} \exp\lp-\frac{1}{2\sigma^2}(y_i - x_i^T \beta) \rp^2 \]
\[ p(y \mid x, \beta, \sigma^2) \propto (\sigma^2)^{-n/2} \exp\lp-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i - x_i^T \beta) \rp^2 \]
\[ p(y \mid x, \beta, \sigma^2) \propto (\sigma^2)^{-n/2} \exp\lp-\frac{1}{2\sigma^2}(y - x \beta)^T(y - x \beta) \rp \]
\[ \begin{aligned} p(\beta, \sigma^2 \mid y, x) & \propto (\sigma^2)^{-n/2} \exp\lp-\frac{1}{2\sigma^2}(y - x \beta)^T(y - x \beta) \rp \\ & \quad \times (\sigma^2)^{-p / 2} \exp\lp-\frac{1}{2\sigma^2}(\beta - \mu_0)^T \Sigma_0^{-1}(\beta - \mu_0) \rp \\ & \quad \times (\sigma^2)^{-(\nu_0/2 + 1)} \exp\lp -\frac{\nu_0 s_0}{2 \sigma^2} \rp \end{aligned} \]
We can sort of squint at this expression and glean that we’ll get some sort of normal distribution conditional on \(\sigma^2\) and a marginal distribution over \(\sigma^2\), but we don’t know what the parameters will be of those posteriors
The first thing we’ll need to do is to combine the expression for \(\beta\) in the prior and in the likelihood into one term
\[ \begin{aligned} & \exp\lp-\frac{1}{2\sigma^2}(y - x \beta)^T(y - x \beta) \rp\exp\lp-\frac{1}{2\sigma^2}(\beta - \mu_0)^T \Sigma_0^{-1}(\beta - \mu_0) \rp \\ & =\exp\lp-\frac{1}{2\sigma^2}\lp(y - x \beta)^T(y - x \beta) + (\beta - \mu_0)^T \Sigma_0^{-1}(\beta - \mu_0) \rp \rp \end{aligned} \]