Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GFlowNet Training by Policy Gradients (2408.05885v2)

Published 12 Aug 2024 in cs.LG and stat.ML

Abstract: Generative Flow Networks (GFlowNets) have been shown effective to generate combinatorial objects with desired properties. We here propose a new GFlowNet training framework, with policy-dependent rewards, that bridges keeping flow balance of GFlowNets to optimizing the expected accumulated reward in traditional Reinforcement-Learning (RL). This enables the derivation of new policy-based GFlowNet training methods, in contrast to existing ones resembling value-based RL. It is known that the design of backward policies in GFlowNet training affects efficiency. We further develop a coupled training strategy that jointly solves GFlowNet forward policy training and backward policy design. Performance analysis is provided with a theoretical guarantee of our policy-based GFlowNet training. Experiments on both simulated and real-world datasets verify that our policy-based strategies provide advanced RL perspectives for robust gradient estimation to improve GFlowNet performance.

Citations (1)

Summary

  • The paper introduces a novel policy-gradient formulation for GFlowNet training, linking cumulative reward maximization to KL divergence minimization.
  • It details the use of Actor-Critic and TRPO techniques to achieve robust gradient estimates and stable policy updates compared to traditional methods.
  • The approach improves convergence speed and supports integrated backward policy guidance through enhanced reward designs and a joint training strategy.

This paper, "GFlowNet Training by Policy Gradients" (2408.05885), introduces a novel framework for training Generative Flow Networks (GFlowNets) by leveraging techniques from policy-based Reinforcement Learning (RL). GFlowNets are designed to sample combinatorial objects xx (like graphs or sequences) with probability proportional to a given non-negative reward function R(x)R(x). Traditional GFlowNet training methods, such as Trajectory Balance (TB), resemble value-based RL, focusing on satisfying flow balance equations across states in a Directed Acyclic Graph (DAG) representation of the generation process.

The core idea of this paper is to reformulate the GFlowNet training objective in a way that directly connects it to optimizing expected cumulative rewards in an RL setting, enabling the use of policy gradient methods.

Key Contributions and Concepts:

  1. RL Formulation with Policy-Dependent Rewards:
    • The authors define novel policy-dependent reward functions. For the forward policy PF(;θ)P_F(\cdot|\cdot; \theta) (parameterized by θ\theta), the reward for taking action a=(ss)a=(s \rightarrow s') is:

      RF(s,a;θ)=logπF(s,a;θ)πB(s,a;ϕ)R_F(s, a; \theta) = \log \frac{\pi_F(s, a; \theta)}{\pi_B(s', a; \phi)}

      where πF(s,a;θ)=PF(ss;θ)\pi_F(s, a; \theta) = P_F(s'|s; \theta) and πB(s,a;ϕ)=PB(ss;ϕ)\pi_B(s', a; \phi) = P_B(s|s'; \phi) is the backward policy probability (parameterized by ϕ\phi). A similar reward RBR_B is defined for the backward policy.

    • Maximizing the expected cumulative reward JF=Eμ(s0)[VF(s0)]J_F = \mathbb{E}_{\mu(s_0)}[V_F(s_0)] (where VFV_F is the value function associated with RFR_F) is shown to be equivalent (in terms of gradients) to minimizing the KL divergence between the forward trajectory distribution PF(τs0)P_F(\tau|s_0) and a target distribution derived from the backward policy and reward, P~B(τs0)=PB(τx)R(x)/Z\widetilde{P}_B(\tau|s_0) = P_B(\tau|x)R(x)/Z. This directly links the proposed RL objective to established GFlowNet objectives like Trajectory Balance (TB).

  2. Policy-Based Training Strategies:
    • Based on the RL formulation, standard policy gradient algorithms can be applied:
      • Vanilla Policy Gradient (Actor-Critic): The paper uses the REINFORCE rule with advantage estimation. The advantage AF(s,a)=QF(s,a)VF(s)A_F(s,a) = Q_F(s,a) - V_F(s) can be estimated using Generalized Advantage Estimation (GAE) with a parameter λ[0,1]\lambda \in [0, 1] to control the bias-variance trade-off. This contrasts with TB, which implicitly uses λ=1\lambda=1 with an empirical return estimate and potentially a constant baseline. The policy-based approach uses a learned value function V~F\widetilde{V}_F as a functional baseline.
      • Trust Region Policy Optimization (TRPO): A TRPO objective is proposed to update the policy πF\pi_F conservatively, aiming for more stable learning by constraining the KL divergence between the old and new policies within a trust region ζF\zeta_F.
  3. Guided Backward Policy Design as RL:
    • The quality of the backward policy PBP_B influences training efficiency. The paper formulates the design of PBP_B as another RL problem.
    • It introduces a "guided" trajectory distribution PG(τx)P_G(\tau|x), which can incorporate heuristics or domain knowledge (e.g., using a replay buffer of high-reward samples).
    • A new reward RBG(s,a;ϕ)=logπB(s,a;ϕ)πG(s,a)R_B^G(s', a; \phi) = \log \frac{\pi_B(s', a; \phi)}{\pi_G(s', a)} is defined, where πG\pi_G is derived from PGP_G.
    • Minimizing the associated objective JBGJ_B^G trains the Markovian PBP_B to mimic the potentially non-Markovian PGP_G.
  4. Coupled Training Strategy:
    • A joint training strategy (Algorithm 1) is proposed where the forward policy PFP_F (and total flow ZZ) is updated using RFR_F, and the backward policy PBP_B is updated using either RBR_B (if PG=PFP_G=P_F) or RBGR_B^G (if guidance PGP_G is used). This avoids separate pre-training phases for PBP_B.
    • Theoretical justification (Theorem 3.6) shows that minimizing JFJ_F and JBGJ_B^G helps bound the objective JFGJ_F^G, which measures the discrepancy between PFP_F and the guided distribution PGP_G.

Implementation Details:

  • Networks: Requires parameterizing the forward policy πF(;θF)\pi_F(\cdot|\cdot; \theta_F), the backward policy πB(;ϕ)\pi_B(\cdot|\cdot; \phi), the total flow estimator Z(θZ)Z(\theta_Z), and potentially value functions V~F(;ηF)\widetilde{V}_F(\cdot; \eta_F), V~B(;ηB)\widetilde{V}_B(\cdot; \eta_B).
  • Policy Gradient Updates: Standard actor-critic updates are used.
    • Compute policy-dependent rewards (RFR_F, RBR_B, RBGR_B^G) based on current network outputs.
    • Estimate advantages (e.g., using GAE with λ\lambda).
    • Update policy parameters (θF\theta_F, ϕ\phi) using policy gradients.
    • Update value function parameters (ηF\eta_F, ηB\eta_B) by minimizing the squared error between estimated returns and value predictions.
    • Update total flow parameter θZ\theta_Z using its gradient component (derived from JF\nabla J_F or LTB\nabla \mathcal{L}_{TB}).
  • TRPO Implementation: Requires computing the Fisher-vector product (using conjugate gradient) to solve the TRPO constraint efficiently and performing a line search for the step size.
  • Guided Policy: The guided distribution PGP_G needs to be defined based on the specific task, potentially using heuristics or replay buffers as shown in the experiments (e.g., penalizing early termination in low-reward areas for hypergrids, using mean rewards from replay buffer for sequence design).
  • Hyperparameters: Key hyperparameters include learning rates, the GAE parameter λ\lambda (ablation shows λ0.99\lambda \approx 0.99 works well), and the TRPO trust region size ζF\zeta_F.

Experimental Findings:

  • Experiments on hyper-grids, biological/molecular sequence generation (SIX6, QM9, PHO4, sEH), and Bayesian Network structure learning show that the proposed policy-based methods (RL-U, RL-B, RL-T, RL-G) often achieve faster convergence than value-based methods (DB, TB, Sub-TB, TB-TS).
  • Policy-based methods, particularly RL-T (TRPO) and RL-G (Guided), often reach comparable or better final performance (measured by DTVD_{TV}, DJSDD_{JSD}, Accuracy) compared to strong baselines like TB-U.
  • The results suggest that policy-based methods provide more robust gradient estimates, especially in high-dimensional or sparse reward settings.
  • The ablation paper on λ\lambda confirms its role in balancing bias and variance, with values slightly less than 1 often outperforming λ=1\lambda=1 (which corresponds more closely to empirical returns used in TB).

Practical Implications:

  • Provides an alternative, potentially more robust and faster, way to train GFlowNets compared to existing methods like TB.
  • Leverages well-established RL algorithms (Actor-Critic, GAE, TRPO) and their implementations.
  • The framework allows systematically incorporating guidance or heuristics into the backward policy training (RL-G) for improved efficiency.
  • Offers flexibility in choosing the advantage estimator (via λ\lambda) to tune the bias-variance trade-off for gradient estimation based on the specific problem.
  • The TRPO variant (RL-T) offers a way to achieve more stable training updates.

In summary, the paper establishes a strong connection between GFlowNet training and policy-based RL, leading to new training algorithms that demonstrate empirical benefits in terms of convergence speed and robustness, particularly by leveraging advanced gradient estimation and stabilization techniques from the RL literature.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets