Hamiltonian Monte Carlo

Main Points

  • Hamiltonian Monte Carlo uses gradient information and dynamic simulation to reduce random-walk and increase acceptance rate
    • The performance scales well with the number of dimensions
    • This lecture introduces the basic HMC and No-U-Turn-Sampler based dynamic HMC
    • Other useful variants have been developed recently
  • Stan is the most popular probabilistic programming framework
    • Many recent probprog frameworks use dynamic HMC samplers
    • This lecture introduces Stan language and main features
    • Later you can also use higher level packages built on top of Stan

BDA Chapter 12

  • 12.1 Efficient Gibbs samplers (not part of the course)
  • 12.2 Efficient Metropolis jump rules (not part of the course)
  • 12.3 Further extensions to Gibbs and Metropolis (not part of the course)
  • 12.4 Hamiltonian Monte Carlo (important)
  • 12.5 Hamiltonian dynamics for a simple hierarchical model (useful example)
  • 12.6 Stan: developing a computing environment (useful intro)

Extra material for HMC / NUTS

Extra material for Stan

Chapter 12 demos

Hamiltonian Monte Carlo

  • Originally for quantum-chromo-dynamic simulation (Duane et al., 1987)
  • Radford Neal started using for Bayesian neural networks in 1990’s
  • The performance scales well with the number of dimensions
  • Hoffman and Gelman’s (2014) NUTS variant and step size adaptation made it more robust wrt the algorithm parameters and thus easier to use
  • Stan was the first probabilistic programming framework using HMC+NUTS

Hamiltonian Monte Carlo

  • Uses log density (negative log density is called energy)
  • Uses gradient of log density for more efficient sampling

Hamiltonian Monte Carlo / No-U-Turn sampling

  1. HMC basics (static HMC)

  2. HMC + leapfrog discretization + Metropolis (static HMC)

    • Duane et al. (1987)
  3. NUTS + slice sampling + Metropolis (dynamic HMC)

    • Hoffman & Gelman et al. (2014)
  4. NUTS + multinomial (dynamic HMC)

    • Betancourt (2018)

Hamiltonian Monte Carlo - Two sampling steps

  1. Sample from \(p(\phi)\)
    • Define \(p(\phi) = \text{Normal}(0,1)\)
  2. Metropolis update for \(p(\theta,\phi)=p(\theta)p(\phi)\)
    • Proposal from Hamiltonian dynamic simulation

Hamiltonian dynamic simulation

  • Joint distribution in terms of “potential energy” \(U(\theta)\) and “kinetic energy” \(K(\phi)\) \[ \begin{aligned} p(\theta,\phi)& =p(\theta)p(\phi) \\ & =\frac{1}{Z}\exp(-(U(\theta)+K(\phi))) \\ & =\frac{1}{Z}\exp(-H(\theta,\phi)) \end{aligned} \]

where:

  • \(U\) is potential energy function

  • \(K\) is kinetic energy function

  • \(H\) is Hamiltonian energy function

  • \(\phi\) is called a momentum variable

  • \[U(\theta)=-\log(p(\theta))+C\]

  • \[K(\phi)=-\log(p(\phi))+C\]

Simple dynamic system

  • Imagine a hockey puck with mass \(m\) on an ice rink connected to the boards by a spring

  • It can move only in one dimension; let’s call it’s position relative to the board \(\theta\)

  • If we push the puck towards the board, compressing the spring, we increase the potential energy of the puck

  • Letting go of the puck, the potential energy turns into kinetic energy over time by changing the momentum, \(\phi\) of the puck

  • In this system, the potential energy is \(\frac{1}{2c} \theta^2\), while the kinetic energy is \(\frac{1}{2m}\phi^2\)

Hamilton’s equations

  • We want to find the equations of motion for the system, i.e. \(\theta(t)\) and \(\phi(t)\)

  • Hamilton’s equations gives us a recipe to do so: \[ \begin{aligned} \frac{d\theta_i}{dt} & = \frac{\partial H}{\partial \phi_i}\\ \frac{d\phi_i}{dt} & = -\frac{\partial H}{\partial \theta_i} \end{aligned} \]

  • \(U(\theta) + K(\phi) = \frac{1}{2c} \theta^2 + \frac{1}{2m} \phi^2\) \[ \begin{aligned} \frac{d\theta}{dt} & = \phi / m \\ \frac{d\phi}{dt} & = -\theta / c \\ \end{aligned} \]

  • This is a system of ODEs with the following solution \[ \begin{aligned} \theta(t) & = r \cos(\frac{t}{\sqrt{mc}} + a) \\ \phi(t) & = - r\sqrt{m/c} \sin(\frac{t}{\sqrt{mc}} + a)\\ \end{aligned} \]

  • The total energy, or the Hamiltonian, has the form \[ \begin{aligned} U(\theta) + K(\phi) & = \frac{1}{2c} \theta^2 + \frac{1}{2m} \phi^2 \\ & = \frac{1}{2c} r^2 \cos^2(\frac{t}{\sqrt{mc}} + a) + \frac{1}{2m}\frac{m}{c} r^2 \sin^2(\frac{t}{\sqrt{mc}} + a) \\ & = \frac{r^2}{2c}(\cos^2(\frac{t}{\sqrt{mc}} + a) + \sin^2(\frac{t}{\sqrt{mc}} + a))\\ & = r^2 / 2c \end{aligned} \]

Visualization of trajectories

Energy plots

Hamiltonian Monte Carlo - Complete algorithm

Two sampling steps:

  1. Sample from \(p(\phi)\)
    • Define \(p(\phi) = \mathcal{N}(0,1)\)
  2. Metropolis update for \(p(\theta,\phi)=p(\theta)p(\phi)\)
    • Proposal from Hamiltonian dynamic simulation: \(p(\theta,\phi) \propto \exp(-H(\theta,\phi))\)

Recap from last week

  • Hamiltonian dynamics can generate perfect proposal distributions for \(\theta_s(t) \mid \theta_{s-1}(0)\)

  • Good properties of Hamiltonian proposals require tuning the time horizon \(t\) over which Hamilton’s equations will be solved

  • An optimal \(t\) will be such that \(\text{Cov}(\theta_s(t), \theta_{s-k}(0)) \approx 0\) for all \(k = 1, \dots \infty\)

  • Integrating Hamilton’s equations can’t be done analytically for most problems, so we resort to numerical integration

  • The leapfrog integrator is the standard integrator because it has good properties:

    • Preserves volume

    • Reversible

    • Discretization error does not usually grow in time

  • These properties hold as long as the step-size \(\epsilon\) is chosen appropriately

  • Under numerical integration, choosing a \(t\) is equivalent to choosing \(L = t / \epsilon\), or the number of integrator steps needed to get from \(\theta_{s-1}(0)\) to \(\theta_s(t)\)

  • We need to choose \(t\) and \(\epsilon\)

Leapfrog discretization + Metropolis

  • Leapfrog discretization

    • Due to the discretization error the simulation steps away from the constant contour
  • Metropolis step (with acceptance probability based on): \[r=\exp\left(-H(\theta^*,\phi^*)+H(\theta^{(t-1)},\phi^{(t-1)})\right)\]

    • Accept if the Hamiltonian energy in the end is higher
    • Accept with some probability if the Hamiltonian energy in the end is lower

Two steps of Hamiltonian Monte Carlo

  • Perfect simulation keeps \(p(\theta,\phi)\) constant

  • Discretized simulation keeps changes in \(p(\theta,\phi)\) small

  • Alternating sampling from \(p(\phi)\) is crucial for moving to \((\theta,\phi)\) points with different joint density

Leapfrog discretization - Step size

Trade-offs:

  • Small step size → high acceptance rate, but many log density and gradient evaluations
  • Big step size → less log density and gradient evaluations, but lower acceptance rate and the simulation may diverge

Leapfrog discretization - Number of steps

  • Many steps can reduce random walk
  • Many steps require many log density and gradient evaluations

Static Hamiltonian Monte Carlo

No-U-Turn sampler

  • Adaptively selects number of steps
    • NUTS is a dynamic HMC algorithm, where dynamic refers to the dynamic trajectory length
    • Simulate until a U-turn is detected
    • The number of simulation steps doubled if no U-turn yet
  • To keep reversibility of Markov chain
    • Need to simulate in two directions
    • Choose a point along the simulation path with slice sampling
    • Metropolis acceptance step for the selected point
  • For further efficiency
    • Simulation path parts further away from the starting point can have higher probability
    • Max tree depth to keep computation in control
  • Demo: https://chi-feng.github.io/mcmc-demo/

No-U-Turn sampler with multinomial sampling

  • Original NUTS
    • Choose a point along the simulation path with slice sampling
    • Possibly with bigger weighting for further points
    • Metropolis acceptance step for the selected point
    • If the proposal is rejected the previous state is also the new state
  • NUTS with multinomial sampling
    • Compute the probability of selecting a point and accepting it for all points
    • Select the point with multinomial sampling
    • More likely to accept a point that is not the previous one
  • Demo: https://chi-feng.github.io/mcmc-demo/

Mass matrix and the step size adaptation

  • Mass matrix refers to having different scaling for different parameters and optionally also rotation to reduce correlations
    • Mass matrix is estimated during the adaptation phase of the warm-up
    • Mass matrix is estimated using the draws so far
  • Step size
  • After adaptation the algorithm parameters are fixed and some more iterations run to finish the warmup

Max tree depth diagnostic

  • NUTS specific diagnostic
    • The dynamic simulation is build as a binary tree (from Hoffman & Gelman, 2014)
    • Maximum simulation length, i.e. maximum number of steps, is capped to avoid very long waiting times in case of bad behavior
  • Indicates inefficiency in sampling leading to higher autocorrelations and lower ESS (\(S_{\text{eff}}\))
    • Very low inefficiency can indicate problems that need to be addressed
    • Moderate inefficiency doesn’t invalidate the result
  • Different parameterizations matter

Divergences (Example 1)

  • HMC specific: indicates that Hamiltonian dynamic simulation has problems with unexpected fast changes in log-density (compared to the used step size)
    • Indicates possibility of biased estimates

Demo: https://chi-feng.github.io/mcmc-demo/

Divergences (Example 2)

HMC specific: indicates that Hamiltonian dynamic simulation has problems with unexpected fast changes in log-density

Problematic example output:

variable  mean median    sd   mad    q5   q95  rhat ess_bulk ess_tail
lsigma0   -1.0   -1.0  0.10 0.070  -1.2 -0.93   1.2      7.2      19.

Divergences (Example 3)

HMC specific: indicates that Hamiltonian dynamic simulation has problems with unexpected fast changes in log-density

Warning: 29 of 4000 (1.0%) transitions ended with a divergence.
See https://mc-stan.org/misc/warnings for details.

Divergences (Example 4)

HMC specific: indicates that Hamiltonian dynamic simulation has problems with unexpected fast changes in log-density

Problematic parameterization:

variable  mean median    sd   mad    q5   q95  rhat ess_bulk ess_tail
lsigma0   -1.9   -2.1  0.41  0.13  -2.3 -0.91   2.0      5.8      14.

Divergences (Example 5)

HMC specific: indicates that Hamiltonian dynamic simulation has problems with unexpected fast changes in log-density

With non-centered parameterization (NCP):

variable  mean median    sd   mad     q5   q95  rhat ess_bulk ess_tail
sigma0    0.27   0.24  0.19  0.20  0.023  0.64   1.0    1571.    1737.
lsigma0  -1.7   -1.4   1.1   0.87 -3.8   -0.45   1.0    1571.    1737.

Problematic distributions

  • Nonlinear dependencies
    • Simple mass matrix scaling doesn’t help
  • Funnels
    • Optimal step size depends on location
  • Multimodal
    • Difficult to move from one mode to another
  • Long-tailed with non-finite variance and mean
    • Efficiency of exploration is reduced
    • Central limit theorem doesn’t hold for mean and variance

Some other recent HMC and gradient based variants

  • ChEES-HMC (Hoffman et al., 2021)
    • A GPU friendly adapted but fixed simulation length
    • Static after adaptation
  • MEADS (Hoffman & Sountsov, 2022)
    • A GPU friendly multi-chain adaptation for generalized HMC (Horowitz, 1991)
    • Instead of simulation length, need to choose the partial update rate
  • MALT (Riou-Durand and Vogrinc, 2022; Riou-Durand et al., 2022)
    • A GPU friendly method related to GHMC
    • Avoids momentum flips after rejection
  • WALNUTS (Bou-Rabee et al., 2025)
    • Adaptive step size within dynamic simulation