ST 559: Introduction

Rob Trangucci

“Classical” approach to statistics

  • Data \(Y\) is assumed to arise from a model \(f_Y(y \mid \theta)\)

  • Choose a class of estimators (unbiased, maximum likelihood, least squares, etc.)

  • Infer \(\hat{\theta}\) from \(Y\)

  • Calculate the sampling distribution \(\hat{\theta}\) assuming repeated draws from \(f_Y(y \mid \theta)\).

Issue 1: Consequences (losses)

  • On its face, classical statistics seems fine

  • However, as L.J. Savage puts it in Savage (1972):

  • […] the problems of statistics were almost always thought of as problems of deciding what to say rather than what to do

  • Namely, the end use (and end user) of an estimate isn’t considered

  • Consider two scenarios:

    • \(\theta\) represents the expected proportion of NYTimes app users who click on a banner ad (what is the consequence of underestimation for the NYT?)

    • \(\theta\) represents the expected proportion of cancer patients for whom a new treatment leads to remission (what is the consequence of underestimation for the researchers?)

Solutions to problem 1: Decision theory and loss functions

  • You’ve already seen some decision theory in ST 562

    • Let the set of actions one can take be \(\mathcal{A}\) and let the true state of nature be \(\theta\).

    • A loss function for a single action, \(a\), is \(L(a, \theta)\), which maps an action (or a decision) and the state of nature to a number representing the cost of the action

    • Examples: squared error: \(L(a,\theta) = (a - \theta)^2\), absolute error: \(L(a,\theta) = \abs{a - \theta}\)

Solutions to problem 1: More involved loss functions

  • Asymmetric loss:

    • \[ L(a,\theta) = \begin{cases} 10(a - \theta)^2 & a < \theta \\ (a - \theta)^2 & a \geq \theta \end{cases} \]
  • Linear loss under discrete actions \(a \in \{1, 2\}\)

    • \[ L(a, \theta) = \begin{cases} K_1 + k_1 \theta & a = 1 \\ K_2 + k_2 \theta & a = 2 \end{cases} \]

Issue 2: Importance of context in statistics

A (modified) example from L.J. Savage: Consider the following experiments in which we are trying to infer \(\theta\), the probability that a person will answer a question correctly:

  • A self-described coffee expert claims to be able to determine whether his iced latte was made with milk poured into the cup prior to the espresso shot or if the espresso shot went in first followed by the milk. In a randomized experiment, he correctly identifies the order 10 out of 10 times

  • An internationally renowned beer expert claims to be able to distinguish West Coast IPAs from New England IPAs. She correctly identifies the beers in a randomized study 10 out 10 times.

  • You are celebrating an OSU win over UO baseball at Squirrel’s when you meet a dejected Duck fan who swears he is clairvoyant and can predict the outcome of a fair coin flip. He correctly predicts 10 out of 10 flips.

Issue 2: Importance of context in statistics

  • Berger (2013) notes: each example in the previous slide would lead to an MLE for \(\theta = 1\), and a classical hypothesis test would yield a p-value of \(2^{-10}\) under the null \(\theta = 1/2\).

  • However, the weight we would give the evidence in each example would differ (How would we weigh the different examples?)

Bayesian inference

  • Bayesian inference is a way of weighing evidence from the likelihood against a prior distribution

  • A Bayesian probability model has two pieces

    • Likelihood: \(f_Y(y \mid \theta)\), or the sampling distribution for the data

    • Prior: \(p(\theta)\), which measures the beliefs about \(\theta\) prior to observing data \(Y\)

Bayesian inference cont’d

  • Bayesian inference uses Bayes’ rule to update the prior to a posterior:

\[ p(\theta \mid y) = \frac{f_Y(y \mid \theta) p(\theta)}{\int_{\Theta} f_Y(y \mid \theta) p(\theta) d\theta} \]

Sources of uncertainty

  • “Aleatoric” uncertainty means stochastic uncertainty; this is the uncertainty that frequentist statistics considers

    • Coin flips, measurement error, sampling uncertainty

    • For a given dataset, there is no uncertainty in frequentism

  • “Epistemic” uncertainty is uncertainty due to lack of knowledge: Bayesian inference adds epistemic uncertainty

    • The prior and the posterior track epistemic uncertainty

    • This is what allows us to make probability statements conditional on a given dataset

    • Different observers can have different knowledge of a problem

Example: Drawing colored poker chips from a bag

  • Probability of red \(\theta = \frac{\textcolor{red}{\text{\#red}}}{\textcolor{red}{\text{\#red}} + \textcolor{orange}{\text{\#orange}}}\)

  • \(Y_i\) is the color of a draw with replacement from the bag, \(p(Y_i = \textcolor{red}{\text{red}} \mid \theta)\): aleatoric uncertainty

  • \(p(\theta)\): epistemic uncertainty

  • Picking chips from the bag will change our uncertainty about the proportion

  • \(p(\theta \mid \textcolor{red}{\text{red}}, \textcolor{orange}{\text{orange}}, \textcolor{orange}{\text{orange}}, \textcolor{red}{\text{red}}, \dots) = ?\)

  • Use Bayes rule: \[ p(\theta \mid y) = \frac{f_Y(y \mid \theta) p(\theta)}{\int_{\Theta} f_Y(y \mid \theta) p(\theta) d\theta} \]

Fundamental understandings of probability

  • Frequentist probability is, as the name implies, defined by long-run frequencies of outcomes

    • This seems like a good mathematical model for simple outcomes like dice throws, or poker hands, but doesn’t make as much sense for modeling outcomes like the winner of the Michigan/Arizona Basketball game on Saturday
  • Bayesian probability, in contrast, measures degrees of belief, and can be constructed from a coherent betting strategy on the outcomes of events

Frequentist vs. Bayesian uses of the loss function

  • The Bayesian approach to using the loss function is straightforward:

    • Given a posterior \(p(\theta \mid y)\), for each action \(a\) compute \(\rho(a) = \ExpA{L(a, \theta)}{\theta \mid y}\)

    • Pick the action that minimizes the expected loss: \(a^* = \argmin_a \rho(a)\)

  • The Frequentist approach to using the loss function is more involved

Frequentist use of the loss function

  • There is no epistemic uncertainty in frequentist inference, so we can’t compute \(\rho(a)\).

  • Instead we need a way to choose an \(a\) given a dataset, \(Y\), which we’ll denote \(\delta_k(y)\).

  • Then we compute risk, \(R(\delta_k, \theta) = \ExpA{L(\delta_k(Y), \theta)}{Y \mid \theta}\)

  • We can then rank decision rules \(\delta_m\) vs. \(\delta_k\) by comparing \(R(\delta_k, \theta)\) and \(R(\delta_j, \theta)\)

  • For instance, we can pick a decision rule by minimizing the maximum risk: \(\argmin_{k} \sup_{\theta} R(\delta_k, \theta)\)

Key differences

  • Bayesian model building is separate from a loss function

  • Frequentist model building is dependent on the loss function

Hurdle to Bayesian inference: Computation

Calculating integrals in the denominator is hard

\[ p(\theta \mid y) = \frac{f_Y(y \mid \theta) p(\theta)}{\int_{\Theta} f_Y(y \mid \theta) p(\theta) d\theta} \]

Directly approximating \[\int_{\Theta} f_Y(y \mid \theta) p(\theta) d\theta,\] is hard for reasons we’ll get into later in the course

Monte Carlo approximation of integrals

  • An alternative is to approximate functionals of the posterior:

    • \(\ExpA{f(\theta)}{\theta \mid y} = \int_{\Theta} f(\theta) p(\theta \mid y) d\theta\)

    • If we can draw \(\theta^{(s)}, s = 1, \dots, S\) from \(p(\theta \mid y)\), then \[\ExpA{f(\theta)}{\theta \mid y} \approx \frac{1}{S} \sum_{s=1}^S f(\theta^{(s)})\] by the WLLN

  • But how do we draw \(\theta^{(s)} \sim p(\theta \mid y)\)?

Bayesian Articles over Time

Lynch and Bartlett (2019)

Lynch and Bartlett (2019)

Zooming in

  • What happened in 1990?

    • Gelfand and Smith (1990) - Gibbs sampling
  • What happened after 2005?

    • Plummer et al. (2003) - JAGS (Just-Another-Gibbs-Sampler), open source Gibbs Sampler

    • Gelman, Lee, and Guo (2015) - Stan, second-gen PPL and inference algorithms

Evolution of probabilistic programming languages: BUGS/JAGS

  • Bayesian inference Using Gibbs Sampling was one of the first probabilistic programming languages

  • Flexible syntax allows for definition of (nearly) arbitrary statistical models

  • Simulation algorithm can sample from (nearly) arbitrary posteriors

  • Hamstrung by reliance on Gibbs sampling, which doesn’t scale well to high dimensions

Evolution of probabilistic programming languages: Stan

  • 2011: Andrew Gelman’s post-doc Matt Hoffman proposes a modification to an existing MCMC algorithm which allows for sampling from (nearly) arbitrary posteriors

  • 2012: First version of Stan released

    • Flexible syntax allows for definition of arbitrary statistical models

    • Scales well to high dimensions due to the “magic” of Hamiltonian Monte Carlo

Stan programs

data {
  int<lower=0> N;
  vector[N] y;
}
parameters {
  real<lower=0> sd_y;
  real mean_y;
}
model {
  y ~ normal(mean_y, sd_y);
  mean_y ~ student_t(4, 0, 10);
  sd_y ~ student_t(4, 0, 5);
}

Bayesian inference and computation

  • Bayesian inference inextricably linked with computation

  • Leaps forward in computation allow for more complex models

  • …which in turn creates more demand for better samplers

Model criticism and comparison

  • Given that Bayesian inference + modern computation allows for big models, how do we tell if they’re any good?

  • Posterior predictive checks \[ p(\tilde{y} \mid y) = \int_{\Theta} p(\tilde{y} \mid \theta) p(\theta \mid y) d\theta \]

  • Leave-one-out cross validation \[ p(\tilde{y}_i \mid y) = \int_{\Theta} p(\tilde{y}_i \mid \theta) p(\theta \mid y_{-i}) d\theta \]

  • How do we set priors?

  • Prior predictive checks \[ p(\tilde{y}) = \int_{\Theta} p(\tilde{y} \mid \theta) p(\theta) d\theta \]

To-do for Thursday

  • Read Chapter 1 in Bayesian Data Analysis

  • Refresh knowledge of

    • Probability

    • PDFs, PMFs, CDFs

    • Likelihood, Bayes rule

References

Berger, James O. 2013. Statistical Decision Theory and Bayesian Analysis. Springer Science & Business Media.
Gelfand, Alan E., and Adrian F. M. Smith. 1990. “Sampling-Based Approaches to Calculating Marginal Densities.” Journal of the American Statistical Association 85 (410): 398–409. https://doi.org/10.2307/2289776.
Gelman, Andrew, Daniel Lee, and Jiqiang Guo. 2015. “Stan: A Probabilistic Programming Language for Bayesian Inference and Optimization.” Journal of Educational and Behavioral Statistics 40 (5): 530–43.
Plummer, Martyn et al. 2003. “JAGS: A Program for Analysis of Bayesian Graphical Models Using Gibbs Sampling.” In Proceedings of the 3rd International Workshop on Distributed Statistical Computing, 124:1–10. 125.10. Vienna, Austria.
Savage, Leonard J. 1972. The Foundations of Statistics. 2d rev. ed. New York: Dover Publications.