Relative Trajectory Balance (RTB)

Updated 4 July 2026

Relative Trajectory Balance (RTB) is an off-policy objective that matches a learned trajectory distribution to a reward- or likelihood-tilted reference distribution.
It uses a squared log-ratio loss which enforces a balance condition, ensuring that the marginal posterior aligns with the terminal reward-weighted prior.
RTB supports stable training through replay buffers and loss clipping, and it is theoretically equivalent to KL-regularized methods like Trust-PCL under specific reward designs.

Relative Trajectory Balance (RTB) is a trajectory-level, off-policy squared-residual objective for sequential generative models in which a learned trajectory distribution is trained relative to a reference or prior trajectory distribution, with terminal weighting provided by a reward, likelihood, or energy term. In the diffusion-posterior formulation, RTB enforces the pathwise relation $Z_\phi\,p_\phi(\tau)\approx r(x_1)\,p_\theta(\tau)$ , so that marginalizing trajectories yields $p_\phi(x_1)\propto p_\theta(x_1)\,r(x_1)$ ; in the sequential fine-tuning formulation, it targets a terminal marginal of the form $P_\phi^\top(s_T)\propto \pi_{\mathrm{prior}}^\top(s_T)\exp(-E(s_T)/\alpha)$ (Scimeca et al., 12 Mar 2025, Deleu et al., 1 Sep 2025). A later theoretical analysis places RTB inside KL-regularized reinforcement learning by proving an exact equivalence to Trust-PCL up to a positive constant factor, reframing RTB as a path-consistency objective rather than a fundamentally separate paradigm (Deleu et al., 1 Sep 2025).

1. Core formulation

In the diffusion-based presentation, RTB is defined on a denoising trajectory

$\tau = (x_0 \rightarrow x_{\Delta t}\rightarrow \cdots \rightarrow x_1), \qquad \Delta t=\frac{1}{T},$

with Markov transitions

$p(x_{t+\Delta t}\mid x_t)=\mathcal N\!\left(x_{t+\Delta t}\mid x_t+u_t(x_t)\Delta t,\sigma_t^2\Delta t\,I\right),$

and trajectory density

$p(x_0,x_{\Delta t},\dots,x_1)=p(x_0)\prod_{i=1}^T p(x_{i\Delta t}\mid x_{(i-1)\Delta t}).$

Given a pretrained prior diffusion model $p_\theta$ , a learned posterior model $p_\phi$ , a positive terminal reward $r(x_1)$ , and a learnable scalar normalizer $Z_\phi$ , the RTB loss is

$p_\phi(x_1)\propto p_\theta(x_1)\,r(x_1)$ 0

The zero-loss condition implies

$p_\phi(x_1)\propto p_\theta(x_1)\,r(x_1)$ 1

so the objective is designed to recover a reward-tilted posterior rather than merely a high-reward policy (Scimeca et al., 12 Mar 2025).

The same paper gives a conditional, amortized version for inverse problems: $p_\phi(x_1)\propto p_\theta(x_1)\,r(x_1)$ 2 Here the learned sampler is conditioned on an observation $p_\phi(x_1)\propto p_\theta(x_1)\,r(x_1)$ 3, and the scalar normalizer becomes a function $p_\phi(x_1)\propto p_\theta(x_1)\,r(x_1)$ 4. This turns RTB into an amortized posterior-training objective: a single model can be trained to generalize across measurements rather than solve each posterior at sampling time from scratch (Scimeca et al., 12 Mar 2025).

In the finite-horizon sequential fine-tuning presentation, RTB is written as an expectation over an arbitrary behavior distribution $p_\phi(x_1)\propto p_\theta(x_1)\,r(x_1)$ 5: $p_\phi(x_1)\propto p_\theta(x_1)\,r(x_1)$ 6 with residual

$p_\phi(x_1)\propto p_\theta(x_1)\,r(x_1)$ 7

This formulation emphasizes the same structure in a different notation: a learned trajectory law $p_\phi(x_1)\propto p_\theta(x_1)\,r(x_1)$ 8 is matched to a reference trajectory law $p_\phi(x_1)\propto p_\theta(x_1)\,r(x_1)$ 9 tilted by terminal energy $P_\phi^\top(s_T)\propto \pi_{\mathrm{prior}}^\top(s_T)\exp(-E(s_T)/\alpha)$ 0 and normalized by $P_\phi^\top(s_T)\propto \pi_{\mathrm{prior}}^\top(s_T)\exp(-E(s_T)/\alpha)$ 1 (Deleu et al., 1 Sep 2025).

2. From trajectory balance to relative trajectory balance

RTB inherits its basic pathwise logic from Trajectory Balance (TB), introduced for GFlowNets as a trajectory-level alternative to flow matching and detailed balance. In the original TB formulation, for a complete trajectory $P_\phi^\top(s_T)\propto \pi_{\mathrm{prior}}^\top(s_T)\exp(-E(s_T)/\alpha)$ 2,

$P_\phi^\top(s_T)\propto \pi_{\mathrm{prior}}^\top(s_T)\exp(-E(s_T)/\alpha)$ 3

and any global minimizer defines a policy that samples exactly from the reward-proportional target distribution (Malkin et al., 2022). TB is therefore an absolute balance condition: forward trajectory mass, multiplied by a learned partition function, is matched directly to terminal reward times backward path mass.

RTB is a relative version of this construction. In the diffusion derivation, the target unnormalized density is $P_\phi^\top(s_T)\propto \pi_{\mathrm{prior}}^\top(s_T)\exp(-E(s_T)/\alpha)$ 4, and the appendix shows that RTB arises from a TB-style objective in which posterior and prior share the same fixed backward process, causing the backward terms to cancel. The resulting balance equation is

$P_\phi^\top(s_T)\propto \pi_{\mathrm{prior}}^\top(s_T)\exp(-E(s_T)/\alpha)$ 5

so the reference process appears explicitly inside the numerator-denominator ratio rather than only through a terminal reward (Scimeca et al., 12 Mar 2025).

A later theoretical paper sharpens this relation. It studies RTB as a method for fine-tuning sequential generative models toward

$P_\phi^\top(s_T)\propto \pi_{\mathrm{prior}}^\top(s_T)\exp(-E(s_T)/\alpha)$ 6

and proves that RTB is exactly equivalent to Trust-PCL when the reward design satisfies

$P_\phi^\top(s_T)\propto \pi_{\mathrm{prior}}^\top(s_T)\exp(-E(s_T)/\alpha)$ 7

with the correspondences

$P_\phi^\top(s_T)\propto \pi_{\mathrm{prior}}^\top(s_T)\exp(-E(s_T)/\alpha)$ 8

Under these identifications,

$P_\phi^\top(s_T)\propto \pi_{\mathrm{prior}}^\top(s_T)\exp(-E(s_T)/\alpha)$ 9

This is an exact objective equivalence up to the multiplicative factor $\tau = (x_0 \rightarrow x_{\Delta t}\rightarrow \cdots \rightarrow x_1), \qquad \Delta t=\frac{1}{T},$ 0, not merely an analogy at the level of fixed points or asymptotic optima (Deleu et al., 1 Sep 2025).

The same paper also shows an on-policy gradient relation. If one differentiates RTB while ignoring derivatives through the sampling distribution, then the resulting gradient equals a REINFORCE-with-KL gradient up to a factor $\tau = (x_0 \rightarrow x_{\Delta t}\rightarrow \cdots \rightarrow x_1), \qquad \Delta t=\frac{1}{T},$ 1, provided the REINFORCE baseline is chosen as

$\tau = (x_0 \rightarrow x_{\Delta t}\rightarrow \cdots \rightarrow x_1), \qquad \Delta t=\frac{1}{T},$ 2

This places RTB even more directly inside the family of KL-regularized path-consistency methods rather than outside it (Deleu et al., 1 Sep 2025).

3. Off-policy character and optimization mechanics

A defining property of RTB is that it is explicitly formulated as an off-policy objective. In the sequential formulation, the expectation is taken under an arbitrary behavior distribution $\tau = (x_0 \rightarrow x_{\Delta t}\rightarrow \cdots \rightarrow x_1), \qquad \Delta t=\frac{1}{T},$ 3, and in the diffusion formulation the gradient with respect to $\tau = (x_0 \rightarrow x_{\Delta t}\rightarrow \cdots \rightarrow x_1), \qquad \Delta t=\frac{1}{T},$ 4 does not require backpropagation through the trajectory-generation process. This permits training trajectories to come from distributions other than the current model, including replay-buffer policies and backtracking-based behavior policies (Deleu et al., 1 Sep 2025, Scimeca et al., 12 Mar 2025).

The diffusion paper makes this operational with replay and backtracking exploration. It defines a replay buffer

$\tau = (x_0 \rightarrow x_{\Delta t}\rightarrow \cdots \rightarrow x_1), \qquad \Delta t=\frac{1}{T},$ 5

and a behavior distribution

$\tau = (x_0 \rightarrow x_{\Delta t}\rightarrow \cdots \rightarrow x_1), \qquad \Delta t=\frac{1}{T},$ 6

where $\tau = (x_0 \rightarrow x_{\Delta t}\rightarrow \cdots \rightarrow x_1), \qquad \Delta t=\frac{1}{T},$ 7 is a prioritization distribution over stored terminal samples. The intended use is to obtain high-target-density terminal states $\tau = (x_0 \rightarrow x_{\Delta t}\rightarrow \cdots \rightarrow x_1), \qquad \Delta t=\frac{1}{T},$ 8, sample reverse/noising trajectories

$\tau = (x_0 \rightarrow x_{\Delta t}\rightarrow \cdots \rightarrow x_1), \qquad \Delta t=\frac{1}{T},$ 9

and then train RTB on those trajectories even if the current sampler $p(x_{t+\Delta t}\mid x_t)=\mathcal N\!\left(x_{t+\Delta t}\mid x_t+u_t(x_t)\Delta t,\sigma_t^2\Delta t\,I\right),$ 0 cannot yet reach those modes on-policy. The paper presents this as a direct mechanism for mode discovery and for preventing catastrophic forgetting once good terminal samples have been discovered (Scimeca et al., 12 Mar 2025).

The same source describes two practical stabilizations. First, it replaces the squared loss by the empirical variance over a minibatch of the quantity inside the square, thereby removing dependence on $p(x_{t+\Delta t}\mid x_t)=\mathcal N\!\left(x_{t+\Delta t}\mid x_t+u_t(x_t)\Delta t,\sigma_t^2\Delta t\,I\right),$ 1; this is described as a relative variant of VarGrad. Second, it applies loss clipping by skipping updates on trajectories whose loss is already close to $p(x_{t+\Delta t}\mid x_t)=\mathcal N\!\left(x_{t+\Delta t}\mid x_t+u_t(x_t)\Delta t,\sigma_t^2\Delta t\,I\right),$ 2. Both modifications are presented as numerical stabilizers, especially in conditional settings (Scimeca et al., 12 Mar 2025).

The equivalence paper uses off-policy capability to revisit an illustrative example from the RTB literature. It argues that earlier negative results for KL-regularized RL were partly caused by a reward mismatch, and reports that off-policy REINFORCE with self-normalized importance sampling (SNIS), when given the correct reward design and the same exploratory behavior policy, can recover the expected tilted target distribution as accurately as RTB on the toy task (Deleu et al., 1 Sep 2025). This does not negate RTB’s off-policy utility; rather, it narrows the dispute to formulation and reward specification.

4. RTB for Bayesian inverse problems with diffusion priors

The first major applied use of RTB in the supplied literature is posterior training for Bayesian inverse problems with pretrained diffusion priors. The inferential target is

$p(x_{t+\Delta t}\mid x_t)=\mathcal N\!\left(x_{t+\Delta t}\mid x_t+u_t(x_t)\Delta t,\sigma_t^2\Delta t\,I\right),$ 3

with $p(x_{t+\Delta t}\mid x_t)=\mathcal N\!\left(x_{t+\Delta t}\mid x_t+u_t(x_t)\Delta t,\sigma_t^2\Delta t\,I\right),$ 4 represented implicitly by a diffusion model rather than a tractable closed-form density. RTB addresses the intractability of $p(x_{t+\Delta t}\mid x_t)=\mathcal N\!\left(x_{t+\Delta t}\mid x_t+u_t(x_t)\Delta t,\sigma_t^2\Delta t\,I\right),$ 5 by working on path space: it compares entire posterior trajectories to prior trajectories and uses the likelihood $p(x_{t+\Delta t}\mid x_t)=\mathcal N\!\left(x_{t+\Delta t}\mid x_t+u_t(x_t)\Delta t,\sigma_t^2\Delta t\,I\right),$ 6 as terminal reward $p(x_{t+\Delta t}\mid x_t)=\mathcal N\!\left(x_{t+\Delta t}\mid x_t+u_t(x_t)\Delta t,\sigma_t^2\Delta t\,I\right),$ 7 (Scimeca et al., 12 Mar 2025).

The paper instantiates this framework on several concrete inverse problems. For inpainting,

$p(x_{t+\Delta t}\mid x_t)=\mathcal N\!\left(x_{t+\Delta t}\mid x_t+u_t(x_t)\Delta t,\sigma_t^2\Delta t\,I\right),$ 8

For Fourier phase retrieval,

$p(x_{t+\Delta t}\mid x_t)=\mathcal N\!\left(x_{t+\Delta t}\mid x_t+u_t(x_t)\Delta t,\sigma_t^2\Delta t\,I\right),$ 9

For nonlinear deblur,

$p(x_0,x_{\Delta t},\dots,x_1)=p(x_0)\prod_{i=1}^T p(x_{i\Delta t}\mid x_{(i-1)\Delta t}).$ 0

For gravitational lensing,

$p(x_0,x_{\Delta t},\dots,x_1)=p(x_0)\prod_{i=1}^T p(x_{i\Delta t}\mid x_{(i-1)\Delta t}).$ 1

The conditional RTB objective then trains an amortized posterior sampler conditioned on $p(x_0,x_{\Delta t},\dots,x_1)=p(x_0)\prod_{i=1}^T p(x_{i\Delta t}\mid x_{(i-1)\Delta t}).$ 2 rather than a training-free guided sampler specialized to a single observation (Scimeca et al., 12 Mar 2025).

A practical modification introduces a Langevin-biased posterior drift: $p(x_0,x_{\Delta t},\dots,x_1)=p(x_0)\prod_{i=1}^T p(x_{i\Delta t}\mid x_{(i-1)\Delta t}).$ 3 This is described as an inductive bias that allows reward information to influence intermediate diffusion states rather than only the terminal state. The paper also notes that one may use parameter-efficient fine-tuning, that full backpropagation through the sampling process is not required, and that one may backpropagate only through a subset of trajectory summands when full trajectory memory is too expensive (Scimeca et al., 12 Mar 2025).

Empirically, the paper compares RTB with training-free posterior methods such as DPS, FPS, FPS-SMC, classifier-guidance-style methods, CLA in gravitational lensing, and hybrids such as RTB+DPS and RTB+DPS+CLA. Its main reported pattern is that guidance-based methods often achieve stronger likelihood or reconstruction metrics, whereas RTB tends to achieve much better $p(x_0,x_{\Delta t},\dots,x_1)=p(x_0)\prod_{i=1}^T p(x_{i\Delta t}\mid x_{(i-1)\Delta t}).$ 4, which the authors interpret as evidence of greater posterior faithfulness. The reported metrics include $p(x_0,x_{\Delta t},\dots,x_1)=p(x_0)\prod_{i=1}^T p(x_{i\Delta t}\mid x_{(i-1)\Delta t}).$ 5, LPIPS, FID, and $p(x_0,x_{\Delta t},\dots,x_1)=p(x_0)\prod_{i=1}^T p(x_{i\Delta t}\mid x_{(i-1)\Delta t}).$ 6. In gravitational lensing, RTB used $p(x_0,x_{\Delta t},\dots,x_1)=p(x_0)\prod_{i=1}^T p(x_{i\Delta t}\mid x_{(i-1)\Delta t}).$ 7 diffusion steps, while CLA required $p(x_0,x_{\Delta t},\dots,x_1)=p(x_0)\prod_{i=1}^T p(x_{i\Delta t}\mid x_{(i-1)\Delta t}).$ 8 steps for reasonable samples; at the same time, the paper notes that training can be unstable for very peaky rewards, and in lensing about $p(x_0,x_{\Delta t},\dots,x_1)=p(x_0)\prod_{i=1}^T p(x_{i\Delta t}\mid x_{(i-1)\Delta t}).$ 9 of runs diverged (Scimeca et al., 12 Mar 2025).

The paper’s abstract states that existing training-free diffusion posterior methods struggle to perform effective posterior inference in latent space due to inherent biases. The supplied body text does not provide a formal latent-variable derivation of this bias, but it does present the broader empirical claim that heuristic posterior-guidance dynamics can shift samples away from the true posterior, often yielding artificially high likelihood values at the cost of reduced diversity (Scimeca et al., 12 Mar 2025).

RTB is closely connected to, but distinct from, several later or adjacent trajectory-balance methods. The most direct precursor is standard TB in GFlowNets, whose motivation was improved long-range credit assignment relative to flow matching and detailed balance. TB imposes a single global constraint on a complete sampled trajectory and was empirically shown to improve convergence, diversity of generated samples, and robustness to long action sequences and large action spaces (Malkin et al., 2022).

Not every trajectory-balance method built around a reference model is RTB. The paper “Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training” explicitly does not introduce or discuss Relative Trajectory Balance. Its method, TBA, uses standard TB—implemented through a batch-estimated “VarGrad variant of trajectory balance”—inside an asynchronous actor–learner system with replay buffers, and its key systems claim is that TB is off-policy. The paper states that there is no objective named RTB, no derivation of an RTB loss, and no comparison between RTB and standard TB (Bartoldson et al., 24 Mar 2025).

A conceptually close but terminologically distinct development appears in “Beyond Mode-Seeking RL: Trajectory-Balance Post-Training for Diffusion LLMs.” That paper does not mention RTB by name, but its TraFL objective defines a reward-tilted target relative to a frozen reference trajectory distribution,

$p_\theta$ 0

and minimizes a squared residual against that reference-anchored target. The paper presents TraFL as a response to “trajectory locking,” a failure mode in diffusion-language-model post-training in which sampled reward-driven updates over-concentrate probability mass on a narrow set of denoising paths (Ahmadi et al., 13 May 2026). A plausible implication is that RTB-style reference-relative balance principles had already become a reusable design pattern beyond the original diffusion-posterior setting.

Another related method is Rooted absorbed Prefix Trajectory Balance (RapTB). That paper is not presented as RTB, but it introduces a root-relative auxiliary residual

$p_\theta$ 1

which cancels the global normalizer and provides dense prefix-level supervision. The paper is explicit that standard terminal TB remains the only exact balance condition whose optimum matches the reward-proportional target, while the rooted relative term is an auxiliary variance-reducing regularizer layered on top (Wang et al., 28 Feb 2026). This suggests a lineage in which “relative” constructions increasingly serve credit-assignment and normalization-cancellation roles even when full RTB is not used.

6. Terminological boundaries and recurrent misconceptions

The term “RTB” is not unique to machine learning. In cosmology, RTB denotes Ricci-trace-based gravity, the subclass $p_\theta$ 2 of $p_\theta$ 3 theories, where the gravitational action depends linearly on the Ricci scalar and nontrivially on the trace $p_\theta$ 4 of the matter energy-momentum tensor (Silva et al., 2024). This usage is entirely unrelated to Relative Trajectory Balance in generative modeling and RL.

Within generative-model post-training, another recurring misconception is to treat every TB-derived, reference-anchored, or replay-compatible method as RTB. The available papers do not support that conflation. TBA is explicitly standard TB in an asynchronous system rather than RTB (Bartoldson et al., 24 Mar 2025). TraFL is RTB-like in the sense of being reference-relative and reward-tilted, but it is not introduced as RTB and uses a diffusion-compatible sequence-level surrogate rather than exact pathwise log-probabilities (Ahmadi et al., 13 May 2026). RapTB employs a rooted relative auxiliary residual, but it retains standard TB as the exact global anchor (Wang et al., 28 Feb 2026).

A further point of debate concerns novelty. One line of work presents RTB as a recently introduced off-policy objective that is asymptotically unbiased for posterior inference in diffusion inverse problems (Scimeca et al., 12 Mar 2025). A later theoretical analysis argues that RTB is not a fundamentally new alternative to KL-regularized RL because it is exactly equivalent to Trust-PCL under a specific reward design and parameter identification (Deleu et al., 1 Sep 2025). The literature therefore supports two simultaneously true statements: RTB is a useful and practically distinctive pathwise objective for posterior and generative-model fine-tuning, and RTB can also be understood as a re-expression of pre-existing KL-regularized path-consistency machinery rather than a separate theoretical family.

In that sense, the most stable definition of RTB is not institutional or historical but structural: it is a relative trajectory-balance objective in which a learned trajectory distribution is matched, via a squared log-ratio residual, to a reward- or likelihood-tilted reference trajectory distribution, with a learned normalization term and explicit off-policy compatibility.