Trajectory Balance in GFlowNets

Updated 17 March 2026

Trajectory Balance is a global objective in GFlowNets that enforces consistency across complete trajectories for immediate, dense credit assignment.
It improves convergence by propagating error signals in one shot, overcoming the limitations of local, stepwise credit assignment methods.
The method underpins state-of-the-art advancements in LLM post-training, combinatorial design, and molecule synthesis by enhancing mode coverage and sampling diversity.

A trajectory balance (TB) objective is a global path-wise training principle for Generative Flow Networks (GFlowNets), designed to improve credit assignment and sampling efficiency in generating discrete objects, such as sequences, graphs, or sets, from reward-proportional unnormalized densities. Unlike strictly local or temporal-difference analogues (e.g., flow matching or detailed balance), trajectory balance imposes a consistency constraint on entire sampled trajectories, enabling instantaneous error propagation across all actions along a path. As a result, TB yields superior convergence, robustness for long horizons or large action spaces, and improved diversity in generation. The trajectory balance mechanism now serves as a foundational method in GFlowNets and underlies numerous state-of-the-art advancements in RL for LLM post-training, combinatorial design, and amortized posterior sampling.

1. Formal Definition and Core Mechanism

Given a directed acyclic graph (DAG) of states and actions, a GFlowNet samples objects $x$ via complete trajectories $\tau=(s_0, s_1, ..., s_n=x)$ , with forward transition probabilities $P_F(s_{t}\mid s_{t-1};\theta)$ , backward transitions $P_B(s_{t-1}\mid s_t;\theta)$ , a terminal reward function $R(x)$ , and a learnable normalizer $Z_\theta$ . The trajectory balance constraint asserts

$Z_\theta \prod_{t=1}^{n} P_F(s_{t} \mid s_{t-1}; \theta) = R(x) \prod_{t=1}^n P_B(s_{t-1} \mid s_t; \theta)$

for any complete trajectory ending at $x$ . Deviations are penalized by a squared log-residual: $\mathcal{L}_\mathrm{TB}(\tau) = \left[ \log\frac{Z_\theta \prod_{t=1}^n P_F(s_t\mid s_{t-1};\theta)}{R(x) \prod_{t=1}^n P_B(s_{t-1} \mid s_t; \theta)} \right]^2$ Training is typically performed by sampling full trajectories under current (possibly tempered) policy $\pi_\theta$ , computing gradients of the TB loss, and performing stochastic gradient steps (Malkin et al., 2022).

This global constraint provides immediate, dense credit assignment to all decisions along a sampled path, as the loss couples all actions through the trajectory-level probability ratio.

2. Theoretical Properties and Optimality Guarantees

TB is equipped with a crucial fixed-point theorem: if a model $\{P_F, P_B, Z_\theta\}$ achieves zero TB loss everywhere, then terminal state marginals satisfy $F_\theta(x) = R(x)$ for every $x$ , thus inducing a sampling policy (via $P_F$ ) that is exactly reward-proportional. The correctness argument proceeds by summing the TB identity over all trajectories leading to $x$ , leveraging the normalization of backward flows; this guarantees that achieving zero loss on all trajectories solves the reward matching constraint globally (Malkin et al., 2022).

3. Comparison with Local GFlowNet Objectives

Three principal families of GFlowNet objectives have been developed:

Objective	Constraint Domain	Credit Propagation
Flow Matching	Per-node & terminal	Stepwise (slow for long chains)
Detailed Balance	Per-edge	Stepwise (slow for long chains)
Trajectory Balance	Entire trajectories	One-shot (global)

Both flow matching and detailed balance enforce only local conservation or reversibility constraints, thus requiring information to backpropagate sequentially across each transition—a process that is inefficient for long-horizon tasks or sparse rewards. TB, in contrast, enables credit assignment across the full path in a single loss evaluation, which accelerates convergence and improves mode coverage in practice (Malkin et al., 2022).

4. Practical Training Procedures and Implementation

Standard TB training samples full trajectories under the current forward policy, optionally with temperature or off-policy correction, computes the TB loss per trajectory, and updates the parameters via stochastic gradients. Large-scale applications employ mini-batched or asynchronous variants, importance sampling, and empirical mean estimators (e.g., VarGrad for partition function estimation) (Bartoldson et al., 24 Mar 2025). TB is robust to off-policy sampling and can effectively leverage experience replay by computing the loss for any buffered or asynchronously-generated trajectory with minimal architectural overhead.

5. Extensions and Generalizations

Several recent works extend or adapt TB to new domains and challenges:

Rooted Absorbed Prefix TB (RapTB): Introduces dense prefix-level supervision with absorbed suffix rewards to provide lower-variance, partial-credit signals to early trajectory prefixes, while maintaining TB as a hard global anchor. RapTB propagates terminal credit back to prefixes, alleviating prefix collapse and length bias in compositional generation (Wang et al., 28 Feb 2026).
Submodular Replay: Combines TB objectives with submodular trajectory selection, promoting reward, diversity, and length-coverage to prevent mode collapse in off-policy replay (Wang et al., 28 Feb 2026).
Divergent Trajectory Balance (DTB): Applies the TB objective to an exploration GFlowNet, but selectively suppresses probability mass on already well-covered high-reward regions (over-allocated sets), ensuring exploration is focused on underexplored, high-value states. This forms the basis of Adaptive Complementary Exploration (ACE) (Dall'Antonia et al., 19 Feb 2026).
Hybrid-Policy Sub-Trajectory Balance: Generalizes TB to sub-trajectories (segments) and allows hybridization with off-policy (expert) samples, improving local policy learning and addressing the long-horizon credit assignment problem in generative optimizers (Guan et al., 1 Nov 2025).
Relative Trajectory Balance (RTB): Adapts TB to the fine-tuning of pretrained policies under KL-regularized RL, and is formally shown to be equivalent to Trust-PCL. RTB underlies reward-augmented autoregressive or diffusion model training (Deleu et al., 1 Sep 2025).

6. Empirical Evaluation and Comparative Performance

Experiments across combinatorial design, molecule synthesis, bit-sequence generation, text/LLM post-training, and optimization tasks consistently demonstrate that TB yields:

Faster convergence in $L^1$ or TV error of the terminal sampling distribution relative to true reward-proportional targets, especially in high-dimensional and long-horizon tasks.
Better mode coverage (distributional diversity and completeness; lower clustering in sequence or molecular similarity metrics).
Robustness to longer trajectories and larger action spaces; e.g., TB in molecule design attains higher reward-sample correlation and increased diversity at up to $5\times$ faster runtime than flow matching baselines (Malkin et al., 2022).
In LLM preference/post-training, TB-based asynchronous systems maintain diversity and high win-rate versus KL, outperforming DPO in both performance and sample diversity, with up to $4\times$ wall-clock speedup by decoupling sampling from training (Bartoldson et al., 24 Mar 2025).

7. Contemporary Challenges and Ongoing Developments

Persistent challenges for TB-based GFlowNets include weak credit assignment to early subtrajectories ("prefix collapse"), instability under replay-induced distribution shift, and efficient off-policy scaling. Innovations such as absorbed prefix supervision (RapTB), submodular replay, and explicitly complementary explorer GFlowNets (ACE) demonstrably address several failure modes, yielding state-of-the-art validity, diversity, and optimization tradeoffs on text and molecule generation benchmarks (Wang et al., 28 Feb 2026, Dall'Antonia et al., 19 Feb 2026). The formal equivalence of relative TB to KL-regularized RL further situates TB within the foundation of soft Bellman consistency and enables transfer of theoretical and algorithmic advances between GFlowNets and RL literature (Deleu et al., 1 Sep 2025).

References

"Trajectory balance: Improved credit assignment in GFlowNets" (Malkin et al., 2022)
"Rooted Absorbed Prefix Trajectory Balance with Submodular Replay for GFlowNet Training" (Wang et al., 28 Feb 2026)
"Avoid What You Know: Divergent Trajectory Balance for GFlowNets" (Dall'Antonia et al., 19 Feb 2026)
"Learning an Efficient Optimizer via Hybrid-Policy Sub-Trajectory Balance" (Guan et al., 1 Nov 2025)
"Relative Trajectory Balance is equivalent to Trust-PCL" (Deleu et al., 1 Sep 2025)
"Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training" (Bartoldson et al., 24 Mar 2025)