Single parameter models

Rob Trangucci

Attribution

Clarifications on last lecture

  • Prior predictive distribution for binomial likelihood and beta prior

    • Called the Beta-Binomial: We’ll derive on the ipad…

Walking through setting the prior

beta_fun <- function(pars) {
  alpha <- exp(pars[1])
  beta <- exp(pars[2])
  diff_1 <- qbeta(0.01, shape1 = alpha, shape2 = beta) - 0.02
  diff_2 <- qbeta(0.99, shape1 = alpha, shape2 = beta) - 0.35
  return(diff_1^2 + diff_2^2)
}

res <- optim(c(log(8), log(60)), beta_fun)
alpha_use <- exp(res$par[1])
beta_use <- exp(res$par[2])

Exchangeability

  • Exchangeability is a critical concept to understand when discussing Bayesian models

  • Formally, when we say a sequence of random variables, \(Y_1, Y_2, Y_3, \dots, Y_n, \dots\) are exchangeable, the joint distribution of these variables does not change under any permuation of indices, denoted as \(Y_{\sigma(1)}, Y_{\sigma(2)}, Y_{\sigma(3)}, \dots, Y_{\sigma(n)}, \dots\)

  • Given a parametric density for \(Y_i\) dependent on \(\theta\), the assumption that the \(Y_i\) are conditionally indepdent given \(\theta\) and a prior for \(\theta\), then the marginal joint density for \(Y_i\) is exchangeable:

  • \[ p_{\mathbf{Y}}(y_1, y_2, \dots, y_n) = \int_{\Omega_\Theta} \prod_{i=1}^n p(y_i \mid \theta) p(\theta) d\theta \]

  • The idea goes the opposite direction as well: Given an infinite sequence of exchangeable random variables (this is true if, for every \(n\), the finite sequences are exchangeable), then there must exist a random variable \(\Theta\) such that the following holds \[ p_{\mathbf{Y}}(y_1, y_2, \dots, y_n) = \int_{\Omega_\Theta} \prod_{i=1}^n p(y_i \mid \theta) p(\theta) d\theta \]

  • Further, \(\lim_{n\to\infty}\frac{1}{n}\sum_{i=1}^n \to \Theta\)

Exchangeable sequences are positively correlated

Let \(Y_i, i = 1, 2, \dots\) be exchangeable

\[ \begin{aligned} \text{Cov}(Y_1, Y_2) & = \ExpA{\text{Cov}(Y_1, Y_2 \mid \theta)}{\Theta} + \text{Cov}_{\Theta}(\Exp{Y_1 \mid \theta}, \Exp{Y_2 \mid \theta}) \\ & \class{fragment}{{} = \text{Cov}_{\Theta}(\Exp{Y_1 \mid \theta}, \Exp{Y_2 \mid \theta})} \\ & \class{fragment}{{} = \VarA{\Exp{Y_1 \mid \theta}}{\Theta}} \\ & \class{fragment}{{} \geq 0} \end{aligned} \]

Normal models

  • Normal likelihoods are special for us statisticians because of the CLT (see § 5.5.3 in Casella and Berger (2024))

  • It is computationally convenient

    • pnorm is implented in many many math libraries
  • It has nice analytical properties

    • Easy to bound normal densities and CDFs

    • Jointly normal RVs are conditionally normal

    • Convolved normals are normal

Normally distributed outcomes

  • Normal outcome \(y\) is real-valued

  • Mean \(\theta\) and variance \(\sigma^2\)

  • Density \[ p(\textcolor{red}{y} \mid \theta) = \frac{1}{\sqrt{2\pi}\sigma}\exp\lp-\frac{1}{2\sigma^2}(\textcolor{red}{y} - \theta)^2 \rp \]

  • \(\int_{-\infty}^{\infty} p(\textcolor{red}{y} \mid \theta) d\textcolor{red}{y} = 1\)

Normal likelihood

  • Likelihood for \(\theta\) (assuming \(\sigma^2\) is known!): \[ p(y \mid \textcolor{red}{\theta}) \propto \exp\lp-\frac{1}{2\sigma^2}(y - \textcolor{red}{\theta})^2 \rp \]

  • In the binomial example, we had \(p(y \mid \theta) \propto \theta^{y} (1 - \theta)^{n - y}\), so we could read off the conjugate prior,

  • Rearranging things in the normal likelihood will help us see the form of the conjugate prior

Normal likelihood in exponential family form

\[ \scriptsize \begin{aligned} p(y \mid \theta) & \propto \exp\lp-\frac{1}{2\sigma^2}(y - \theta)^2 \rp \\ & = \exp\lp-\frac{y^2}{2\sigma^2} + \frac{y}{\sigma^2} \theta -\frac{\theta^2}{2\sigma^2}\rp \\ & \propto \exp\lp\frac{\theta}{\sigma^2} y -\frac{\theta^2}{2\sigma^2}\rp \end{aligned} \]

Prior Form

  • \[ \begin{aligned} p(\theta \mid \mu_0, \tau_0) & \propto \exp\lp \frac{\mu_0}{\tau_0^2} \theta - \frac{\theta^2}{2\tau_0^2}\rp \\ \end{aligned} \]
  • Now we compute \(p(\theta \mid y, \mu_0, \tau_0) \propto p(y \mid \theta) p(\theta \mid \mu_0, \tau_0)\) \[ p(\theta \mid y, \mu_0, \tau_0) \propto \exp\lp\frac{\theta}{\sigma^2} y -\frac{\theta^2}{2\sigma^2}\rp \exp\lp \frac{\mu_0}{\tau_0^2} \theta - \frac{\theta^2}{2\tau_0^2}\rp \]

Working through the algebra

\[ \scriptsize \begin{aligned} & \exp\lp\theta\lp\frac{y}{\sigma^2} + \frac{\mu_0}{\tau_0^2}\rp -\frac{\theta^2}{2}\lp\frac{1}{\sigma^2} + \frac{1}{\tau_0^2} \rp\rp \\ & = \exp\lp\theta\lp\frac{\mu_1}{\tau_1^2}\rp -\frac{\theta^2}{2}\lp\frac{1}{\tau_1^2}\rp\rp \\ & \implies \end{aligned} \] \[ \scriptsize \begin{aligned} \frac{1}{\tau_1} & = \frac{1}{\sigma^2} + \frac{1}{\tau_0^2} && \frac{\mu_1}{\tau_1} = \frac{y}{\sigma^2} + \frac{\mu_0}{\tau_0^2} \\ \mu_1 & = \lp \frac{y}{\sigma^2} + \frac{\mu_0}{\tau_0^2}\rp \tau_1 && \mu_1 = \lp\frac{y}{\sigma^2} + \frac{\mu_0}{\tau_0^2}\rp / \lp \frac{1}{\sigma^2} + \frac{1}{\tau_0^2}\rp \end{aligned} \]

A tale of two likeihoods

Interpreting the parameters

  • \[ \mu_1 = \mu_0 + (y - \mu_0) \frac{\tau_0^2}{\tau_0^2 + \sigma^2} \]
  • vs.  \[ \mu_1 = y - (y - \mu_0) \frac{\sigma^2}{\tau_0^2 + \sigma^2} \]

Posterior predictive distribution

  • We might want to make predictions for another observation

  • \[ p(\tilde{y} \mid y) = \int_{-\infty}^\infty p(\tilde{y} \mid \textcolor{red}{\theta}) p(\textcolor{red}{\theta} \mid y, \mu_0, \tau_0) d \textcolor{red}{\theta} \]

  • \[ \begin{aligned} & \int_{-\infty}^\infty p(\tilde{y} \mid \textcolor{red}{\theta}) p(\textcolor{red}{\theta} \mid y, \mu_0, \tau_0) d \textcolor{red}{\theta} \\ & = \int_{-\infty}^\infty \frac{1}{\sqrt{2\pi}\sigma}\exp(-\frac{1}{2\sigma^2}(y - \theta)^2)\frac{1}{\sqrt{2\pi}\tau_1}\exp(-\frac{1}{2\tau_1^2}(\theta_1 - \mu_1)^2) d \textcolor{red}{\theta} \end{aligned} \]

  • We can use the fact that \(\tilde{y} \mid \theta \sim \text{Normal}(\theta, \sigma^2)\) which is equal in distribution to \(Y = \theta + Z\) where \(Z \sim \text{Normal}(0, \sigma^2)\).

  • Marginally, \(Y \sim \text{Normal}(\mu_1, \tau_1^2 + \sigma^2)\)

Exponential families

  • The normal and binomial distributions are examples of a more general class of densities called exponential families

  • The exponential family distributions can be written in the following form

  • \[ p(y_i \mid \theta) = f(y_i) g(\theta)e^{\phi(\theta)^T u(y_i)} \]

  • With \(n\) iid observations, calling \(t(y) = \sum_{i=1}^n u(y_i)\) the joint density is \[ p(y \mid \theta) = \lp \prod_{i=1}^n f(y_i)\rp g(\theta)^n \exp \lp \phi(\theta)^T t(y)\rp \]

  • We can write the density as \[ p(y\mid \theta) \propto g(\theta)^n \exp \lp \phi(\theta)^T t(y)\rp \]

  • This shows that \(t(y)\) is sufficient for \(\theta\)

Conjugate priors in exponential families

  • When \[ p(y\mid \theta) \propto g(\theta)^n \exp \lp \phi(\theta)^T t(y)\rp \]

  • Specifying \(p(\theta) \propto g(\theta)^\eta \exp\lp \phi(\theta)^T \nu \rp\) yields the posterior:

  • \[ p(\theta \mid y) \propto g(\theta)^{n + \eta} \exp \lp\phi(\theta)^T(t(y) + \nu) \rp \]

Back to the normal again

  • With \(n\) observations, we know that \(\bar{y}\) is sufficient for \(\mu\) when \(y \sim \text{Normal}(\mu, \sigma^2)\)

  • We know \(\bar{y} \sim \text{Normal}(\mu, \frac{\sigma^2}{n})\), so we can just use our earlier results to get the posterior \(\mu_n\) and \(\tau_n\) parameters

  • \[ \scriptsize \begin{aligned} \frac{1}{\tau_n} & = \frac{n}{\sigma^2} + \frac{1}{\tau_0^2} \\ \mu_n & = \lp y \frac{n}{\sigma^2} + \frac{\mu_0}{\tau_0^2}\rp / \lp \frac{n}{\sigma^2} + \frac{1}{\tau_0^2}\rp \end{aligned} \]

  • As \(n\to\infty\) we get $_n \(\bar{y}\) and \(\tau^2_n \to \frac{\sigma^2}{n}\)

  • This shows that the posterior for \(\theta\) converges to a normal distribution with mean \(\bar{y}\) and \(\frac{\sigma^2}{n}\)

  • We get the same convergence in distribution if \(\tau_0 \to \infty\)

Models for count data

  • When we have \(y_i \in \{0\} \cup \Z^+\), we’ll need to use something other than a normal

  • One common model is \(y_i \sim \text{Poisson}(\theta)\) \[ p(y_i \mid \theta) = \frac{e^{-\theta} \theta^y_i}{y_i!} \]

  • Rewrite this in exponential family form: \[ e^{-\theta} e^{\log(\theta) y} \]

  • Suggests a prior of the form \[ \begin{aligned} p(\theta) & \propto e^{-\theta \eta} e^{\log(\theta) \nu} \\ & \propto \theta^\nu e^{-\theta \eta} \end{aligned} \]

  • This is a Gamma distribution, or \(p(\theta) = \frac{\beta^\alpha}{\Gamma(\alpha)} \theta^{\alpha - 1} e^{-\beta \theta}\)

Multiple observations

  • With \(n\) iid observations the joint density for \(y\) is \[ \begin{aligned} p(y \mid \theta) & \propto e^{-n\theta} e^{\log(\theta) \sum_{i=1}^n y_i} \\ & \propto e^{-n\theta} \theta^{\sum_{i=1}^n y_i} \\ & \propto e^{-n\theta} \theta^{t(y)} \end{aligned} \]

  • With a prior \(p(\theta) \propto \theta^{\alpha - 1} e^{-\beta \theta}\) we get

  • Posterior \[ p(\theta \mid y) \propto e^{-\theta(n + \beta)} \theta^{t(y) + \alpha} \]

  • Or \(\theta \mid y \sim \text{Gamma}(\alpha + t(y), \beta + n)\)

Deriving the prior/posterior predictive distribution

\[ \begin{aligned} & \frac{\beta^\alpha}{y!\Gamma(\alpha)} \int_0^\infty e^{-\theta} \theta^y \theta^{\alpha - 1} e^{-\beta \theta} d\theta \\ & \class{fragment}{{} = \frac{\beta^\alpha}{y!\Gamma(\alpha)} \int_0^\infty \theta^{y + \alpha - 1} e^{-\theta(\beta + 1)} d\theta} \\ & \class{fragment}{{} = \frac{\beta^\alpha}{y!\Gamma(\alpha)}\frac{\Gamma(\alpha + y)}{(\beta + 1)^{\alpha + y}}} \\ & \class{fragment}{{}= \frac{\Gamma(\alpha + y)}{y!\Gamma(\alpha)}\lp\frac{\beta}{\beta + 1}\rp^\alpha \lp\frac{1}{\beta + 1}\rp^y} \end{aligned} \]

Negative binomial distribution

  • Last slide we derived negative-binomial distribution with parameters \(\alpha\) and \(\beta\)

  • How might we get the mean and variance of this distribution?

  • We can get it by \(y \mid \theta \sim \text{Poisson}(\theta)\) and \(\theta \sim \text{Gamma}(\alpha, \beta)\).

  • \(\Exp{y} = \ExpA{\Exp{y \mid \theta}}{\theta}\)

  • \(\Var{y} = \ExpA{\Var{y \mid \theta}}{\theta} + \VarA{\Exp{y \mid \theta}}{\theta}\)
  • Can be useful to reparameterize in terms of mean and dispersion parameter

Poisson with rate/exposure parameterization

  • Many times it is useful to specify the model for count data \(y_i\) as conditional on an unknown rate and a known exposure term \(x_i\) \[ y_i \sim \text{Poisson}(x_i \theta) \]

  • We can see that this model is no longer exchangeable in \(y_i\), but is exchangeable in \((y_i, x_i)\)

  • Often useful in epidemiological settings, where we want to model the rate of, say, cancer cases per person per year

  • Suppose we observe \(3\) death from asthma over one year for a city of \(200,000\)

  • We want to model the rate of deaths from asthma per \(100{,}000\)

  • \(y_1 \sim \text{Poisson}(2\, \theta) \rightarrow \theta \mid y_1 \sim \text{Gamma}(\alpha + 3, \beta + 2)\)

Scale of the prior

Normal with known mean

  • Though not typical of real-world scenarios, under \(y_i \mid \theta, \sigma^2 \sim \text{Normal}(\theta, \sigma^2)\) we can derive the posterior for \(\sigma^2\) under the assumption that \(\theta\) is known

  • The likelihood corresponding the joint distribution of \(Y_1, Y_2, \dots, Y_n\) is \[ p(y \mid \sigma^2, \theta) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp \lp -\frac{1}{2\sigma^2} (y_i - \theta)^2 \rp \]

  • Simplifying and dropping constants \[ p(y \mid \sigma^2, \theta) \propto (\sigma^2)^{-n/2}\exp \lp -\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \theta)^2 \rp \]

  • Let’s represent \(\sum_{i=1}^n (y_i - \theta)^2\) as \(n v\), so we rewrite this as: \[ p(y \mid \sigma^2, \theta) \propto (\sigma^2)^{-n/2}\exp \lp -\frac{n}{2\sigma^2} v \rp \]

  • The conjugate prior here is called the inverse gamma, \[ p(\sigma^2 \mid \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} (\sigma^2)^{-(\alpha+1)} e^{-\beta/\sigma^2} \]

  • The posterior gives \[ p(\sigma^2 \mid y, \theta, \alpha, \beta) \propto (\sigma^2)^{-(n/2 + \alpha+1)} \exp \lp -(\frac{n}{2} v + \beta)/\sigma^2\rp \]

References

Casella, George, and Roger Berger. 2024. Statistical Inference. Chapman; Hall/CRC.