Policy-Gradient GFlowNet Training

Updated 5 August 2025

The paper demonstrates that recasting GFlowNet training as a policy-gradient problem yields theoretical convergence guarantees and improved empirical performance.
It details how classical RL techniques like REINFORCE and actor-critic methods replace traditional flow-matching with direct optimization of policy-dependent rewards.
The approach unifies energy-based modeling and reinforcement learning, enhancing sample efficiency, robust gradient estimation, and mode coverage in complex generative tasks.

Generative Flow Networks (GFlowNets) are a framework for learning stochastic policies that sequentially generate compositional objects such that the probability of producing a given object is proportional to a positive reward assigned to it. GFlowNet training by policy gradients refers to leveraging the principles and techniques of policy gradient reinforcement learning to optimize these generative policies, contrasting with traditional value-based objectives grounded in flow-matching or trajectory balance. Recent research demonstrates that formulating GFlowNet objectives in a policy-gradient manner enables new training algorithms, provides theoretical convergence guarantees, and yields robust empirical performance across combinatorial and scientific discovery domains (Niu et al., 12 Aug 2024).

1. Reformulating GFlowNet Training as a Policy-Gradient Problem

Policy-gradient GFlowNet methods recast the core flow-balance objective—ensuring the equality of incoming and outgoing flows at each node—into an optimization of expected accumulated (policy-dependent) reward, a familiar paradigm in reinforcement learning (RL) (Niu et al., 12 Aug 2024).

For an action $a = (s \to s')$ chosen in state $s$ , the policy-dependent rewards are defined as: $R_F(s, a; \theta) = \log \left( \frac{\pi_F(s, a; \theta)}{\pi_B(s', a)} \right), \quad R_B(s', a; \phi) = \log \left( \frac{\pi_B(s', a; \phi)}{\pi_F(s, a)} \right)$ where $\pi_F$ is the forward (generative) policy, $\pi_B$ the backward policy, and $\theta,\phi$ their respective parameters (Niu et al., 12 Aug 2024). The cumulative expected reward under the forward policy yields the RL objective: $V_F(s) = \mathbb{E}\left[ \sum_t R_F(s_t, a_t; \theta) \mid s_0 = s \right]$ The overall training target becomes minimizing the KL divergence between the distribution of sampled trajectories under $\pi_F$ and a reference backward process, an equivalence to maximizing expected accumulated reward as in RL: $J_F = \mathbb{E}_{s_0 \sim \mu}[V_F(s_0)] = D_{KL}^{\mu}(P_F(\tau|s_0),\,\widetilde{P}_B(\tau|s_0))$ Policy gradients with respect to $\theta$ can thus be applied directly to $J_F$ ; this formally bridges GFlowNet flow consistency and RL's reward maximization (Niu et al., 12 Aug 2024).

2. Policy-Gradient Training Algorithms and Gradient Derivations

This policy-gradient view enables adopting classical stochastic policy-gradient algorithms, including REINFORCE and actor-critic variants, for GFlowNet training (Niu et al., 12 Aug 2024). The gradient of the RL-formulated GFlowNet objective $J_F$ is given by: $\nabla_\theta J_F = T \cdot \mathbb{E}_{s \sim d_{F, \mu},\,a \sim \pi_F} \left[ A_F(s, a) \nabla_\theta \log \pi_F(s, a; \theta) \right]$ where $A_F(s, a)$ is an advantage function constructed as $A_F(s, a) = Q_F(s, a) - V_F(s)$ . $d_{F, \mu}$ represents the average state distribution over trajectories. The trajectory balance (TB) loss, widely used in GFlowNet literature, is shown to be equivalent in gradient (up to a scaling factor) to the KL divergence term above: $\frac{1}{2}\nabla_\theta \mathcal{L}_{TB}(P_{F, \mu}; \theta) = \nabla_\theta D_{KL}^\mu(P_F(\tau|s_0; \theta), \widetilde{P}_B(\tau|s_0)) + \frac{1}{2} \nabla_\theta (\log Z(\theta) - \log Z^*)^2$ This correspondence demonstrates that direct policy-gradient optimization of the expected cumulative policy-dependent reward is a valid replacement for classical flow-matching objectives, prescribing gradient-based learning strategies familiar from RL but adjusted for GFlowNet-specific rewards (Niu et al., 12 Aug 2024).

3. Coupled Training and Backward Policy Design

A distinctive component of GFlowNet architectures is the inclusion of a backward policy, $\pi_B$ , which acts as a reference for credit assignment and regularization. The policy-gradient framework allows coupled or alternating optimization of both the forward and backward policies. Backward policy learning is formulated as minimizing a KL divergence or other divergence between $P_B$ and a guided target $P_G$ : $\mathcal{L}_{TB}^G(P_B^\rho; \phi) = \mathbb{E}_{P_B^\rho(\tau)}\left[ (\log \frac{P_B(\tau|x; \phi)}{P_G(\tau|x)})^2 \right]$ with gradient

$\frac{1}{2}\nabla_\phi \mathcal{L}_{TB}^G(P_B^\rho; \phi) = \nabla_\phi D_{KL}^\rho(P_B(\tau|x; \phi),\,P_G(\tau|x))$

This procedure can be performed in alternation with forward policy updates, enabling joint adaptation and potentially accelerating convergence, improving credit assignment, and mitigating issues tied to poor backward policy specification (Niu et al., 12 Aug 2024). Theoretical analysis guarantees convergence properties under standard RL assumptions.

4. Theoretical Guarantees and Mode Coverage

The policy-gradient GFlowNet framework inherits theoretical convergence and performance guarantees from the RL literature. Under smoothness and unbiased-gradient assumptions, the optimization process is proven to converge, with monotonically decreasing gradient magnitudes over iterations (Niu et al., 12 Aug 2024).

Empirically, policy-gradient variants achieve faster convergence to the target reward-matching distribution and capture more high-reward modes in challenging combinatorial tasks compared to classical value-based GFlowNet methods (e.g., TB, detailed balance) across simulated hyper-grid, biological sequence, and Bayesian network learning problems. Trust region approaches such as TRPO can be incorporated to further stabilize training and control step sizes by constraining empirical KL divergence between successive policy updates. This leads to both improved sample efficiency and robustness to noise (Niu et al., 12 Aug 2024).

5. Empirical Performance and Robustness

Experimental evidence supports the practical efficacy of policy-based GFlowNet training (Niu et al., 12 Aug 2024):

On synthetic hyper-grid environments, mode accuracy and matching of the learned distribution $P_F^\top$ to the true target $P^*$ are improved, as measured by total variation (TV) and Jensen-Shannon divergence (JSD).
In molecular and biological sequence design tasks, policy-gradient–trained GFlowNets converge more rapidly and capture a higher fraction of high-reward modes compared to conventional GFlowNet training.
The new framework is robust to proxy model errors and does not significantly rely on value estimation accuracy, in contrast to value-based TB and DB losses.

The framework's robust gradient estimation (using variance-reduced returns and effective baselines) is a critical factor in empirical performance, enabling stable optimization even on high-dimensional combinatorial tasks.

6. Implications and Future Directions

The equivalence between flow balance in GFlowNets and entropy-regularized RL objectives suggests multiple avenues for exploration:

Extension of RL variance reduction methods, actor-critic architectures, and intrinsic exploration strategies to further enhance GFlowNet optimization.
Improved backward policy design (potentially guided by domain knowledge) to accelerate credit assignment and convergence in weakly supervised or high-sparsity regimes.
Adoption of trust-region and constraint-based optimization techniques to maintain numerical stability and robustness during large-scale generative modeling (Niu et al., 12 Aug 2024).

By framing GFlowNet learning in the language of policy gradients—with policy-dependent rewards—this approach unifies previously separate perspectives from energy-based modeling, RL, and amortized variational inference, offering a flexible, theoretically grounded, and empirically efficient toolkit for combinatorial generative modeling.

Summary Table: Core Policy-Gradient GFlowNet Elements

Component	Definition/Role	Key Equation/Description
Policy-dependent Reward	$\log \left( \frac{\pi_F(s, a; \theta)}{\pi_B(s', a)} \right)$	Drives expected return for optimization
RL-formulated Objective	$J_F = \mathbb{E}_{s_0 \sim \mu}[V_F(s_0)] = D_{KL}^\mu(P_F,\widetilde{P}_B)$	Recasts flow-balance as accumulated reward maximization
Policy Gradient Update	$\nabla_\theta J_F = T\,\mathbb{E}[A_F\,\nabla_\theta \log \pi_F]$	Standard policy-gradient with GFlowNet-specific reward
Coupled Backward Training	Minimize $\KL(P_B, P_G)$ via policy gradient	Alternating/coupled updates of forward and backward policy
Trust-region Optimization	KL-divergence constraint between policy iterates	Ensures robustness and monotonic improvement

This unified policy-gradient framework for GFlowNet training clarifies the connections between flow networks and RL and offers both theoretical guarantees and empirical advantages, especially in settings requiring efficient, robust, and diverse sampling from complex combinatorial spaces (Niu et al., 12 Aug 2024).

PDF Markdown Chat (Pro)

References (1)

GFlowNet Training by Policy Gradients (2024)

Follow Topic

Get notified by email when new papers are published related to GFlowNet Training by Policy Gradients.