Hamiltonian Monte Carlo

Originally for quantum-chromo-dynamic simulation (Duane et al., 1987)
Radford Neal started using for Bayesian neural networks in 1990’s
The performance scales well with the number of dimensions
Hoffman and Gelman’s (2014) NUTS variant and step size adaptation made it more robust wrt the algorithm parameters and thus easier to use
Stan was the first probabilistic programming framework using HMC+NUTS

Hamiltonian Monte Carlo

Uses log density (negative log density is called energy)
Uses gradient of log density for more efficient sampling

Hamiltonian Monte Carlo / No-U-Turn sampling

HMC basics (static HMC)
HMC + leapfrog discretization + Metropolis (static HMC)
- Duane et al. (1987)
NUTS + slice sampling + Metropolis (dynamic HMC)
- Hoffman & Gelman et al. (2014)
NUTS + multinomial (dynamic HMC)
- Betancourt (2018)

Hamiltonian Monte Carlo - Two sampling steps

Sample from \(p(\phi)\)
- Define \(p(\phi) = \text{Normal}(0,1)\)
Metropolis update for \(p(\theta,\phi)=p(\theta)p(\phi)\)
- Proposal from Hamiltonian dynamic simulation

Hamiltonian dynamic simulation

Joint distribution in terms of “potential energy” \(U(\theta)\) and “kinetic energy” \(K(\phi)\) \[ \begin{aligned} p(\theta,\phi)& =p(\theta)p(\phi) \\ & =\frac{1}{Z}\exp(-(U(\theta)+K(\phi))) \\ & =\frac{1}{Z}\exp(-H(\theta,\phi)) \end{aligned} \]

where:

\(U\) is potential energy function
\(K\) is kinetic energy function
\(H\) is Hamiltonian energy function
\(\phi\) is called a momentum variable
\[U(\theta)=-\log(p(\theta))+C\]
\[K(\phi)=-\log(p(\phi))+C\]

Simple dynamic system

Imagine a hockey puck with mass \(m\) on an ice rink connected to the boards by a spring
It can move only in one dimension; let’s call it’s position relative to the board \(\theta\)
If we push the puck towards the board, compressing the spring, we increase the potential energy of the puck
Letting go of the puck, the potential energy turns into kinetic energy over time by changing the momentum, \(\phi\) of the puck
In this system, the potential energy is \(\frac{1}{2c} \theta^2\), while the kinetic energy is \(\frac{1}{2m}\phi^2\)

Hamilton’s equations

We want to find the equations of motion for the system, i.e. \(\theta(t)\) and \(\phi(t)\)
Hamilton’s equations gives us a recipe to do so: \[ \begin{aligned} \frac{d\theta_i}{dt} & = \frac{\partial H}{\partial \phi_i}\\ \frac{d\phi_i}{dt} & = -\frac{\partial H}{\partial \theta_i} \end{aligned} \]
\(U(\theta) + K(\phi) = \frac{1}{2c} \theta^2 + \frac{1}{2m} \phi^2\) \[ \begin{aligned} \frac{d\theta}{dt} & = \phi / m \\ \frac{d\phi}{dt} & = -\theta / c \\ \end{aligned} \]
This is a system of ODEs with the following solution \[ \begin{aligned} \theta(t) & = r \cos(\frac{t}{\sqrt{mc}} + a) \\ \phi(t) & = - r\sqrt{m/c} \sin(\frac{t}{\sqrt{mc}} + a)\\ \end{aligned} \]
The total energy, or the Hamiltonian, has the form \[ \begin{aligned} U(\theta) + K(\phi) & = \frac{1}{2c} \theta^2 + \frac{1}{2m} \phi^2 \\ & = \frac{1}{2c} r^2 \cos^2(\frac{t}{\sqrt{mc}} + a) + \frac{1}{2m}\frac{m}{c} r^2 \sin^2(\frac{t}{\sqrt{mc}} + a) \\ & = \frac{r^2}{2c}(\cos^2(\frac{t}{\sqrt{mc}} + a) + \sin^2(\frac{t}{\sqrt{mc}} + a))\\ & = r^2 / 2c \end{aligned} \]

Visualization of trajectories

Energy plots

Hamiltonian Monte Carlo - Complete algorithm

Two sampling steps:

Sample from \(p(\phi)\)
- Define \(p(\phi) = \mathcal{N}(0,1)\)
Metropolis update for \(p(\theta,\phi)=p(\theta)p(\phi)\)
- Proposal from Hamiltonian dynamic simulation: \(p(\theta,\phi) \propto \exp(-H(\theta,\phi))\)

Recap from last week

Hamiltonian dynamics can generate perfect proposal distributions for \(\theta_s(t) \mid \theta_{s-1}(0)\)
Good properties of Hamiltonian proposals require tuning the time horizon \(t\) over which Hamilton’s equations will be solved
An optimal \(t\) will be such that \(\text{Cov}(\theta_s(t), \theta_{s-k}(0)) \approx 0\) for all \(k = 1, \dots \infty\)
Integrating Hamilton’s equations can’t be done analytically for most problems, so we resort to numerical integration
The leapfrog integrator is the standard integrator because it has good properties:
- Preserves volume
- Reversible
- Discretization error does not usually grow in time
These properties hold as long as the step-size \(\epsilon\) is chosen appropriately
Under numerical integration, choosing a \(t\) is equivalent to choosing \(L = t / \epsilon\), or the number of integrator steps needed to get from \(\theta_{s-1}(0)\) to \(\theta_s(t)\)
We need to choose \(t\) and \(\epsilon\)

Leapfrog discretization + Metropolis

Leapfrog discretization
- Due to the discretization error the simulation steps away from the constant contour
Metropolis step (with acceptance probability based on): \[r=\exp\left(-H(\theta^*,\phi^*)+H(\theta^{(t-1)},\phi^{(t-1)})\right)\]
- Accept if the Hamiltonian energy in the end is higher
- Accept with some probability if the Hamiltonian energy in the end is lower

Two steps of Hamiltonian Monte Carlo

Perfect simulation keeps \(p(\theta,\phi)\) constant
Discretized simulation keeps changes in \(p(\theta,\phi)\) small
Alternating sampling from \(p(\phi)\) is crucial for moving to \((\theta,\phi)\) points with different joint density

Leapfrog discretization - Step size

Trade-offs:

Small step size → high acceptance rate, but many log density and gradient evaluations
Big step size → less log density and gradient evaluations, but lower acceptance rate and the simulation may diverge

Leapfrog discretization - Number of steps

Many steps can reduce random walk
Many steps require many log density and gradient evaluations

Static Hamiltonian Monte Carlo

Fixed number of steps
Demo: https://chi-feng.github.io/mcmc-demo/

No-U-Turn sampler

Adaptively selects number of steps
- NUTS is a dynamic HMC algorithm, where dynamic refers to the dynamic trajectory length
- Simulate until a U-turn is detected
- The number of simulation steps doubled if no U-turn yet
To keep reversibility of Markov chain
- Need to simulate in two directions
- Choose a point along the simulation path with slice sampling
- Metropolis acceptance step for the selected point
For further efficiency
- Simulation path parts further away from the starting point can have higher probability
- Max tree depth to keep computation in control
Demo: https://chi-feng.github.io/mcmc-demo/

No-U-Turn sampler with multinomial sampling

Original NUTS
- Choose a point along the simulation path with slice sampling
- Possibly with bigger weighting for further points
- Metropolis acceptance step for the selected point
- If the proposal is rejected the previous state is also the new state
NUTS with multinomial sampling
- Compute the probability of selecting a point and accepting it for all points
- Select the point with multinomial sampling
- More likely to accept a point that is not the previous one
Demo: https://chi-feng.github.io/mcmc-demo/

Mass matrix and the step size adaptation

Mass matrix refers to having different scaling for different parameters and optionally also rotation to reduce correlations
- Mass matrix is estimated during the adaptation phase of the warm-up
- Mass matrix is estimated using the draws so far
Step size
- Adjusted to be as big as possible while keeping discretization error in control (adapt_delta)
- Dual averaging demo: https://chi-feng.github.io/mcmc-demo/
After adaptation the algorithm parameters are fixed and some more iterations run to finish the warmup

Max tree depth diagnostic

NUTS specific diagnostic
- The dynamic simulation is build as a binary tree (from Hoffman & Gelman, 2014)
- Maximum simulation length, i.e. maximum number of steps, is capped to avoid very long waiting times in case of bad behavior
Indicates inefficiency in sampling leading to higher autocorrelations and lower ESS (\(S_{\text{eff}}\))
- Very low inefficiency can indicate problems that need to be addressed
- Moderate inefficiency doesn’t invalidate the result
Different parameterizations matter

Divergences (Example 1)

HMC specific: indicates that Hamiltonian dynamic simulation has problems with unexpected fast changes in log-density (compared to the used step size)
- Indicates possibility of biased estimates

Demo: https://chi-feng.github.io/mcmc-demo/

Divergences (Example 2)

HMC specific: indicates that Hamiltonian dynamic simulation has problems with unexpected fast changes in log-density

Problematic example output:

variable  mean median    sd   mad    q5   q95  rhat ess_bulk ess_tail
lsigma0   -1.0   -1.0  0.10 0.070  -1.2 -0.93   1.2      7.2      19.

Divergences (Example 3)

HMC specific: indicates that Hamiltonian dynamic simulation has problems with unexpected fast changes in log-density

Warning: 29 of 4000 (1.0%) transitions ended with a divergence.
See https://mc-stan.org/misc/warnings for details.

Divergences (Example 4)

HMC specific: indicates that Hamiltonian dynamic simulation has problems with unexpected fast changes in log-density

Problematic parameterization:

variable  mean median    sd   mad    q5   q95  rhat ess_bulk ess_tail
lsigma0   -1.9   -2.1  0.41  0.13  -2.3 -0.91   2.0      5.8      14.

Divergences (Example 5)

HMC specific: indicates that Hamiltonian dynamic simulation has problems with unexpected fast changes in log-density

With non-centered parameterization (NCP):

variable  mean median    sd   mad     q5   q95  rhat ess_bulk ess_tail
sigma0    0.27   0.24  0.19  0.20  0.023  0.64   1.0    1571.    1737.
lsigma0  -1.7   -1.4   1.1   0.87 -3.8   -0.45   1.0    1571.    1737.

Problematic distributions

Nonlinear dependencies
- Simple mass matrix scaling doesn’t help
Funnels
- Optimal step size depends on location
Multimodal
- Difficult to move from one mode to another
Long-tailed with non-finite variance and mean
- Efficiency of exploration is reduced
- Central limit theorem doesn’t hold for mean and variance

Some other recent HMC and gradient based variants

ChEES-HMC (Hoffman et al., 2021)
- A GPU friendly adapted but fixed simulation length
- Static after adaptation
MEADS (Hoffman & Sountsov, 2022)
- A GPU friendly multi-chain adaptation for generalized HMC (Horowitz, 1991)
- Instead of simulation length, need to choose the partial update rate
MALT (Riou-Durand and Vogrinc, 2022; Riou-Durand et al., 2022)
- A GPU friendly method related to GHMC
- Avoids momentum flips after rejection
WALNUTS (Bou-Rabee et al., 2025)
- Adaptive step size within dynamic simulation

Hamiltonian Monte Carlo

Main Points

BDA Chapter 12

Extra material for HMC / NUTS

Extra material for Stan

Chapter 12 demos

Hamiltonian Monte Carlo

Hamiltonian Monte Carlo

Hamiltonian Monte Carlo / No-U-Turn sampling

Hamiltonian Monte Carlo - Related methods

Hamiltonian Monte Carlo - Two sampling steps

Hamiltonian dynamic simulation

Simple dynamic system

Hamilton’s equations

Visualization of trajectories

Energy plots

Hamiltonian Monte Carlo - Complete algorithm

Recap from last week

Leapfrog discretization + Metropolis

Two steps of Hamiltonian Monte Carlo

Leapfrog discretization - Step size

Leapfrog discretization - Number of steps

Static Hamiltonian Monte Carlo

No-U-Turn sampler

No-U-Turn sampler with multinomial sampling

Mass matrix and the step size adaptation

Max tree depth diagnostic

Divergences (Example 1)

Divergences (Example 2)

Divergences (Example 3)

Divergences (Example 4)

Divergences (Example 5)

Problematic distributions

Some other recent HMC and gradient based variants