U-turn Chains: Theory and Applications

Updated 4 July 2026

U-turn chains are constructions where a trajectory reverses its direction, used in HMC, diffusion models, and stochastic vertex models to manage retracing and exploration.
In Hamiltonian Monte Carlo, the No-U-Turn Sampler (NUTS) adaptively terminates trajectories to preserve detailed balance and has shown up to a 3× improvement in effective sampling.
Extensions to diffusion and vertex models leverage U-turn concepts to generate manifold-respecting proposals and design reflective boundary rules through integrable algebraic formulations.

“U-turn chains” denotes several technically distinct constructions in which a trajectory is either terminated at the onset of a turn back toward previously visited states, or is explicitly pushed forward and then reversed. In Hamiltonian Monte Carlo, the term refers to Markov chains whose proposals are built from Hamiltonian trajectories that eventually turn around and start heading back toward where they started; the No-U-Turn Sampler (NUTS) was introduced to stop before this retracing regime and thereby eliminate manual tuning of the trajectory length (Hoffman et al., 2011). In recent diffusion-model work, “U-turn chains” has also been used for Markov chains obtained by iterating short forward-backward diffusion steps on a learned data manifold (Kang et al., 26 May 2026). A separate usage appears in colored stochastic vertex models with a U-turn right boundary, where paths reflect at the boundary and may change sign (Zhong, 2024).

1. Core notion of a U-turn

In the Hamiltonian setting, a U-turn is the onset of motion back toward the starting point of a trajectory. For a Hamiltonian path with starting position $\theta$ , current position $\tilde\theta(t)$ , and momentum $\tilde r(t)$ , NUTS formalizes “moving away” versus “moving back” through

$\frac{d}{dt}\frac12\|\tilde\theta(t)-\theta\|^2 = (\tilde\theta(t)-\theta)\cdot \tilde r(t).$

If $(\tilde\theta-\theta)\cdot \tilde r>0$ , the distance from the start is increasing; if $(\tilde\theta-\theta)\cdot \tilde r<0$ , the trajectory is decreasing its distance from the start and has begun to turn back (Hoffman et al., 2011).

NUTS does not use the naive stopping rule “stop when $(\tilde\theta-\theta)\cdot \tilde r<0$ ,” because that rule breaks time reversibility and detailed balance if used directly as an HMC update. Instead it applies a subtree-based criterion. If the current tree has leftmost state $(\theta^-,r^-)$ and rightmost state $(\theta^+,r^+)$ , the U-turn condition is

$(\theta^+ - \theta^-)\cdot r^- < 0 \quad \text{or} \quad (\theta^+ - \theta^-)\cdot r^+ < 0.$

In Euclidean HMC this can also be expressed through the inner product between the current momentum and a trajectory-averaged momentum, and in the Riemannian setting that inner product is replaced by a metric-dependent one with transported momenta (Betancourt, 2013).

2. No-U-Turn sampling in Hamiltonian Monte Carlo

Standard HMC targets a density

$\tilde\theta(t)$ 0

augments $\tilde\theta(t)$ 1 with momentum $\tilde\theta(t)$ 2, and uses Hamiltonian

$\tilde\theta(t)$ 3

Continuous dynamics are approximated by leapfrog integration with step size $\tilde\theta(t)$ 4. After $\tilde\theta(t)$ 5 leapfrog steps, HMC applies a Metropolis correction. Its central tuning problem is the integration time $\tilde\theta(t)$ 6: if $\tilde\theta(t)$ 7 is too small the chain reverts toward random-walk behavior, and if $\tilde\theta(t)$ 8 is too large the trajectory loops and U-turns, so extra computation retraces the path rather than enlarging exploration (Hoffman et al., 2011).

NUTS removes the user-specified number of steps $\tilde\theta(t)$ 9. It introduces a slice variable

$\tilde r(t)$ 0

then builds a binary tree of states by repeatedly doubling the trajectory length forward or backward in time. At each doubling it uses the subtree U-turn test and an energy-error stop condition

$\tilde r(t)$ 1

and it samples from the valid states in the constructed tree by weighted random choice. Because the candidate set is built symmetrically and only slice-consistent states are retained, the resulting transition preserves the target distribution.

NUTS still requires a step size $\tilde r(t)$ 2, but Hoffman and Gelman derive a dual-averaging adaptation scheme for $\tilde r(t)$ 3 during burn-in. In their experiments, $\tilde r(t)$ 4, $\tilde r(t)$ 5, and $\tilde r(t)$ 6, with $\tilde r(t)$ 7, and a helper routine $\tilde r(t)$ 8 searches for an initial $\tilde r(t)$ 9. The target acceptance rate is around $\frac{d}{dt}\frac12\|\tilde\theta(t)-\theta\|^2 = (\tilde\theta(t)-\theta)\cdot \tilde r(t).$ 0 for HMC; for NUTS the experiments show good performance around $\frac{d}{dt}\frac12\|\tilde\theta(t)-\theta\|^2 = (\tilde\theta(t)-\theta)\cdot \tilde r(t).$ 1 and relative insensitivity in the range $\frac{d}{dt}\frac12\|\tilde\theta(t)-\theta\|^2 = (\tilde\theta(t)-\theta)\cdot \tilde r(t).$ 2– $\frac{d}{dt}\frac12\|\tilde\theta(t)-\theta\|^2 = (\tilde\theta(t)-\theta)\cdot \tilde r(t).$ 3.

Empirically, NUTS was evaluated on a 250-dimensional correlated multivariate normal, logistic regression, hierarchical logistic regression, and stochastic volatility. For logistic regression tasks, NUTS matches HMC in ESS per gradient. For the multivariate normal and stochastic volatility models, NUTS with $\frac{d}{dt}\frac12\|\tilde\theta(t)-\theta\|^2 = (\tilde\theta(t)-\theta)\cdot \tilde r(t).$ 4 outperforms the best HMC setting by about a factor of 3 in ESS per gradient. The optimal HMC simulation length $\frac{d}{dt}\frac12\|\tilde\theta(t)-\theta\|^2 = (\tilde\theta(t)-\theta)\cdot \tilde r(t).$ 5 varies by about two orders of magnitude across these problems, which is precisely the dependency NUTS is designed to remove (Hoffman et al., 2011).

3. Geometry, convergence theory, and adaptive variants

A geometric interpretation of the No-U-Turn criterion was developed for Euclidean and Riemannian HMC. In Euclidean HMC with constant mass matrix $\frac{d}{dt}\frac12\|\tilde\theta(t)-\theta\|^2 = (\tilde\theta(t)-\theta)\cdot \tilde r(t).$ 6, the displacement satisfies

$\frac{d}{dt}\frac12\|\tilde\theta(t)-\theta\|^2 = (\tilde\theta(t)-\theta)\cdot \tilde r(t).$ 7

where $\frac{d}{dt}\frac12\|\tilde\theta(t)-\theta\|^2 = (\tilde\theta(t)-\theta)\cdot \tilde r(t).$ 8, and the U-turn criterion becomes

$\frac{d}{dt}\frac12\|\tilde\theta(t)-\theta\|^2 = (\tilde\theta(t)-\theta)\cdot \tilde r(t).$ 9

For approximately harmonic low-energy trajectories, this criterion detects the turning point of the slowest oscillation, the TPOLO. In Riemannian Manifold HMC, naive coordinate differences cease to be geometrically meaningful, so the criterion is generalized to

$(\tilde\theta-\theta)\cdot \tilde r>0$ 0

where $(\tilde\theta-\theta)\cdot \tilde r>0$ 1 is the position-dependent metric and $(\tilde\theta-\theta)\cdot \tilde r>0$ 2 is built from transported momenta using the canonical one-form on $(\tilde\theta-\theta)\cdot \tilde r>0$ 3 (Betancourt, 2013).

The general convergence theory of dynamic HMC places NUTS inside a larger class of algorithms defined by an orbit selection kernel and an index selection kernel. Within this framework, NUTS is shown to leave the target distribution invariant as a by-product, and conditions are established under which NUTS is irreducible and aperiodic and, as a corollary, ergodic. Under tail and regularity assumptions similar to those used for HMC, the paper also shows that NUTS is geometrically ergodic, while fixed-length HMC is ergodic without any boundedness condition on the step size and the number of leapfrog steps for targets that are perturbations of Gaussian distributions (Durmus et al., 2023).

A more quantitative high-dimensional result is available for the canonical Gaussian measure. When initialized in the concentration region, the mixing time of NUTS scales as $(\tilde\theta-\theta)\cdot \tilde r>0$ 4, up to logarithmic factors, and this scaling is argued to be sharp by analogy with the HMC literature. The analysis proceeds by showing that, in the Gaussian concentration shell, NUTS behaves with high probability like an accept/reject chain whose accept kernel is a simpler “Uniform HMC” kernel. The same paper identifies a looping pathology: when some $(\tilde\theta-\theta)\cdot \tilde r>0$ 5 lies too close to $(\tilde\theta-\theta)\cdot \tilde r>0$ 6 or $(\tilde\theta-\theta)\cdot \tilde r>0$ 7, the U-turn criterion becomes ambiguous, orbit selection may fail to stop before the maximum depth, and the chain can spend nearly every transition at maximal tree size. Randomizing the time grid is proposed as a mitigation strategy (Bou-Rabee et al., 2024).

Several post-NUTS variants address aspects that the original algorithm leaves fixed. A GIST-based method incorporates local step-size adaptivity into NUTS by treating the step size as an auxiliary variable and using an acceptance probability that depends exclusively on the conditional distribution of the step size; the method is validated on Neal’s funnel density and a high-dimensional normal distribution (Bou-Rabee et al., 2024). WALNUTS generalizes NUTS further by adapting the leapfrog step size at fixed intervals of simulated time as the orbit evolves, selecting the largest step size from a dyadic schedule that keeps the energy error below a user-specified threshold, and preserving reversibility through an involution on an extended state space (Bou-Rabee et al., 23 Jun 2025). SpreadNUTS proposes a different modification: it replaces strict binary-tree doubling by adding $(\tilde\theta-\theta)\cdot \tilde r>0$ 8 points at the $(\tilde\theta-\theta)\cdot \tilde r>0$ 9-th expansion, uses a $(\tilde\theta-\theta)\cdot \tilde r<0$ 0-ary tree for U-turn checking, and biases selection toward less explored regions through nearest-neighbor distances in a spatial index (Sheriff, 2023).

4. Forward-backward diffusion U-turn Markov chains

In a diffusion-model setting, a single U-turn move of size $(\tilde\theta-\theta)\cdot \tilde r<0$ 1 is defined by a forward corruption step

$(\tilde\theta-\theta)\cdot \tilde r<0$ 2

followed by a reverse denoising step

$(\tilde\theta-\theta)\cdot \tilde r<0$ 3

which induces the U-turn transition kernel

$(\tilde\theta-\theta)\cdot \tilde r<0$ 4

Iterating these moves yields a U-turn Markov chain, or UTMC,

$(\tilde\theta-\theta)\cdot \tilde r<0$ 5

For an exact diffusion model, the kernel satisfies detailed balance with respect to the learned data distribution $(\tilde\theta-\theta)\cdot \tilde r<0$ 6: $(\tilde\theta-\theta)\cdot \tilde r<0$ 7 The same kernel can be used as a Metropolis–Hastings proposal for energy-modified targets

$(\tilde\theta-\theta)\cdot \tilde r<0$ 8

in which case the acceptance probability simplifies to

$(\tilde\theta-\theta)\cdot \tilde r<0$ 9

The diffusion model therefore supplies manifold-respecting proposals, while the energy correction depends only on energy differences (Kang et al., 26 May 2026).

The central dynamical issue is ergodicity. In the Random Hierarchy Model, minimal U-turns are obtained by masking exactly one leaf token at each step, so the U-turn chain becomes a random walk on the graph whose vertices are valid sentences and whose edges connect sentences differing at exactly one leaf. The paper identifies an ergodicity-breaking phase transition controlled by the rule density

$(\tilde\theta-\theta)\cdot \tilde r<0$ 0

and approximates the percolation threshold through a branching-process estimate

$(\tilde\theta-\theta)\cdot \tilde r<0$ 1

with asymptotic behavior

$(\tilde\theta-\theta)\cdot \tilde r<0$ 2

For small $(\tilde\theta-\theta)\cdot \tilde r<0$ 3, the connectivity graph fragments and the chain is non-ergodic; for large $(\tilde\theta-\theta)\cdot \tilde r<0$ 4, a giant connected component exists and the chain mixes over the full data space. Increasing the U-turn magnitude $(\tilde\theta-\theta)\cdot \tilde r<0$ 5 introduces longer-range moves and restores effective ergodicity even below $(\tilde\theta-\theta)\cdot \tilde r<0$ 6.

The same work studies hierarchical relaxation. In the non-ergodic or weakly mixing regime, low-level features relax faster than high-level ones. In the Random Hierarchy Model this appears as

$(\tilde\theta-\theta)\cdot \tilde r<0$ 7

for small $(\tilde\theta-\theta)\cdot \tilde r<0$ 8 and sparse grammars. At sufficiently large U-turn magnitude, the ordering inverts; the inversion threshold is estimated by

$(\tilde\theta-\theta)\cdot \tilde r<0$ 9

Analogous depth-ordered relaxation, and inversion only at large noise when mixing is efficient, is reported for natural language using a masked diffusion LLM probed by Mistral 7B residual-stream activations, and for natural images using guided-diffusion probed by ConvNeXt-Base features. The reported pattern is that minimal U-turns are local, strongly constrained, and weakly mixing, while large U-turns restore mixing at the cost of coarser moves (Kang et al., 26 May 2026).

5. Truncated diffusion loops and U-turn times

A distinct diffusion usage appears in U-Turn diffusion. Here the forward process starts from a ground-truth sample at artificial time $(\theta^-,r^-)$ 0, runs only to a finite U-turn time $(\theta^-,r^-)$ 1, and the reverse process is initialized at that forward endpoint and run back to $(\theta^-,r^-)$ 2. In the variance-preserving SDE,

$(\theta^-,r^-)$ 3

the conditional forward kernel is

$(\theta^-,r^-)$ 4

and the reverse dynamics use the score $(\theta^-,r^-)$ 5. The U-Turn reverse process is initialized at $(\theta^-,r^-)$ 6 with a sample from the probability distribution of the forward process, ensuring a detailed balance relation between the shortened forward and reverse processes (Behjoo et al., 2023).

This construction defines a forward-then-reverse chain anchored at a specific ground-truth sample. The forward auto-correlation satisfies

$(\theta^-,r^-)$ 7

while reverse-process diagnostics are expressed through

$(\theta^-,r^-)$ 8

The score is learned by denoising score matching, and the paper tracks the normalized score norms

$(\theta^-,r^-)$ 9

The abstract reports a critical Memorization Time $(\theta^+,r^+)$ 0, beyond which generated samples diverge from the ground-truth sample used to initialize the U-turn scheme, and a Speciation Time $(\theta^+,r^+)$ 1, where for $(\theta^+,r^+)$ 2, samples begin representing different classes. The supplied technical discussion operationalizes the same transition through auto-correlation decay, score norm saturation, Kolmogorov–Smirnov Gaussianity tests, and KID. In experiments on butterfly images, the KID minimum occurs around $(\theta^+,r^+)$ 3 for linear and sigmoid schedules and around $(\theta^+,r^+)$ 4 for the cosine schedule. The paper also reports that U-turn initialization yields much lower KID than random-noise initialization at the same time. A central conclusion is that the score becomes effectively affine for $(\theta^+,r^+)$ 5 and approximately affine for $(\theta^+,r^+)$ 6, so very large U-turn times move the chain into a regime dominated by nearly Gaussian dynamics rather than sample-specific structure (Behjoo et al., 2023).

6. U-turn boundaries in colored stochastic vertex models

In integrable probability, a different notion of U-turn chain arises in colored stochastic vertex models with U-turn right boundary. The model is defined on a $(\theta^+,r^+)$ 7 rectangular lattice whose horizontal edges carry signed colors

$(\theta^+,r^+)$ 8

while vertical edges carry multiplicity vectors in $(\theta^+,r^+)$ 9. Odd rows contain $(\theta^+ - \theta^-)\cdot r^- < 0 \quad \text{or} \quad (\theta^+ - \theta^-)\cdot r^+ < 0.$ 0 vertices, even rows contain $(\theta^+ - \theta^-)\cdot r^- < 0 \quad \text{or} \quad (\theta^+ - \theta^-)\cdot r^+ < 0.$ 1 vertices, and each pair of rows is coupled on the right boundary by a cap vertex with spectral parameter $(\theta^+ - \theta^-)\cdot r^- < 0 \quad \text{or} \quad (\theta^+ - \theta^-)\cdot r^+ < 0.$ 2. A path entering the cap can reflect without sign change or reflect with a sign flip; for example,

$(\theta^+ - \theta^-)\cdot r^- < 0 \quad \text{or} \quad (\theta^+ - \theta^-)\cdot r^+ < 0.$ 3

with

$(\theta^+ - \theta^-)\cdot r^- < 0 \quad \text{or} \quad (\theta^+ - \theta^-)\cdot r^+ < 0.$ 4

Geometrically, paths move right across $(\theta^+ - \theta^-)\cdot r^- < 0 \quad \text{or} \quad (\theta^+ - \theta^-)\cdot r^+ < 0.$ 5 rows, hit the U-turn cap, then move left across $(\theta^+ - \theta^-)\cdot r^- < 0 \quad \text{or} \quad (\theta^+ - \theta^-)\cdot r^+ < 0.$ 6 rows (Zhong, 2024).

The local weights are stochastic: for fixed input, the sum over admissible outputs is $(\theta^+ - \theta^-)\cdot r^- < 0 \quad \text{or} \quad (\theta^+ - \theta^-)\cdot r^+ < 0.$ 7. This allows the model to be interpreted as a discrete-time inhomogeneous Markov chain on vertical states, with the cap acting as a reflecting boundary rule on colors. The bulk R-matrices satisfy Yang–Baxter equations, and the cap satisfies a reflection equation, so the corresponding double-row transfer matrices form an integrable open-boundary system. Partition functions $(\theta^+ - \theta^-)\cdot r^- < 0 \quad \text{or} \quad (\theta^+ - \theta^-)\cdot r^+ < 0.$ 8 satisfy recursive relations derived from these equations, and after renormalization the resulting functions $(\theta^+ - \theta^-)\cdot r^- < 0 \quad \text{or} \quad (\theta^+ - \theta^-)\cdot r^+ < 0.$ 9 realize the Noumi representation of the affine Hecke algebra of type $\tilde\theta(t)$ 00. In this literature, “U-turn chains” refers not to MCMC but to reflecting colored path ensembles whose boundary reflection is encoded by stochastic cap weights and controlled algebraically by Yang–Baxter and reflection relations (Zhong, 2024).