GFlowNet Training by Policy Gradients (2408.05885v2)

Published 12 Aug 2024 in cs.LG and stat.ML

Abstract: Generative Flow Networks (GFlowNets) have been shown effective to generate combinatorial objects with desired properties. We here propose a new GFlowNet training framework, with policy-dependent rewards, that bridges keeping flow balance of GFlowNets to optimizing the expected accumulated reward in traditional Reinforcement-Learning (RL). This enables the derivation of new policy-based GFlowNet training methods, in contrast to existing ones resembling value-based RL. It is known that the design of backward policies in GFlowNet training affects efficiency. We further develop a coupled training strategy that jointly solves GFlowNet forward policy training and backward policy design. Performance analysis is provided with a theoretical guarantee of our policy-based GFlowNet training. Experiments on both simulated and real-world datasets verify that our policy-based strategies provide advanced RL perspectives for robust gradient estimation to improve GFlowNet performance.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a novel policy-gradient formulation for GFlowNet training, linking cumulative reward maximization to KL divergence minimization.
It details the use of Actor-Critic and TRPO techniques to achieve robust gradient estimates and stable policy updates compared to traditional methods.
The approach improves convergence speed and supports integrated backward policy guidance through enhanced reward designs and a joint training strategy.

This paper, "GFlowNet Training by Policy Gradients" (2408.05885), introduces a novel framework for training Generative Flow Networks (GFlowNets) by leveraging techniques from policy-based Reinforcement Learning (RL). GFlowNets are designed to sample combinatorial objects $x$ (like graphs or sequences) with probability proportional to a given non-negative reward function $R(x)$ . Traditional GFlowNet training methods, such as Trajectory Balance (TB), resemble value-based RL, focusing on satisfying flow balance equations across states in a Directed Acyclic Graph (DAG) representation of the generation process.

The core idea of this paper is to reformulate the GFlowNet training objective in a way that directly connects it to optimizing expected cumulative rewards in an RL setting, enabling the use of policy gradient methods.

Key Contributions and Concepts:

RL Formulation with Policy-Dependent Rewards:
- The authors define novel policy-dependent reward functions. For the forward policy $P_F(\cdot|\cdot; \theta)$ (parameterized by $\theta$ ), the reward for taking action $a=(s \rightarrow s')$ is:
  
  $R_F(s, a; \theta) = \log \frac{\pi_F(s, a; \theta)}{\pi_B(s', a; \phi)}$
  
  where $\pi_F(s, a; \theta) = P_F(s'|s; \theta)$ and $\pi_B(s', a; \phi) = P_B(s|s'; \phi)$ is the backward policy probability (parameterized by $\phi$ ). A similar reward $R_B$ is defined for the backward policy.
- Maximizing the expected cumulative reward $J_F = \mathbb{E}_{\mu(s_0)}[V_F(s_0)]$ (where $V_F$ is the value function associated with $R_F$ ) is shown to be equivalent (in terms of gradients) to minimizing the KL divergence between the forward trajectory distribution $P_F(\tau|s_0)$ and a target distribution derived from the backward policy and reward, $\widetilde{P}_B(\tau|s_0) = P_B(\tau|x)R(x)/Z$ . This directly links the proposed RL objective to established GFlowNet objectives like Trajectory Balance (TB).
Policy-Based Training Strategies:
- Based on the RL formulation, standard policy gradient algorithms can be applied:
  - Vanilla Policy Gradient (Actor-Critic): The paper uses the REINFORCE rule with advantage estimation. The advantage $A_F(s,a) = Q_F(s,a) - V_F(s)$ can be estimated using Generalized Advantage Estimation (GAE) with a parameter $\lambda \in [0, 1]$ to control the bias-variance trade-off. This contrasts with TB, which implicitly uses $\lambda=1$ with an empirical return estimate and potentially a constant baseline. The policy-based approach uses a learned value function $\widetilde{V}_F$ as a functional baseline.
  - Trust Region Policy Optimization (TRPO): A TRPO objective is proposed to update the policy $\pi_F$ conservatively, aiming for more stable learning by constraining the KL divergence between the old and new policies within a trust region $\zeta_F$ .
Guided Backward Policy Design as RL:
- The quality of the backward policy $P_B$ influences training efficiency. The paper formulates the design of $P_B$ as another RL problem.
- It introduces a "guided" trajectory distribution $P_G(\tau|x)$ , which can incorporate heuristics or domain knowledge (e.g., using a replay buffer of high-reward samples).
- A new reward $R_B^G(s', a; \phi) = \log \frac{\pi_B(s', a; \phi)}{\pi_G(s', a)}$ is defined, where $\pi_G$ is derived from $P_G$ .
- Minimizing the associated objective $J_B^G$ trains the Markovian $P_B$ to mimic the potentially non-Markovian $P_G$ .
Coupled Training Strategy:
- A joint training strategy (Algorithm 1) is proposed where the forward policy $P_F$ (and total flow $Z$ ) is updated using $R_F$ , and the backward policy $P_B$ is updated using either $R_B$ (if $P_G=P_F$ ) or $R_B^G$ (if guidance $P_G$ is used). This avoids separate pre-training phases for $P_B$ .
- Theoretical justification (Theorem 3.6) shows that minimizing $J_F$ and $J_B^G$ helps bound the objective $J_F^G$ , which measures the discrepancy between $P_F$ and the guided distribution $P_G$ .

Implementation Details:

Networks: Requires parameterizing the forward policy $\pi_F(\cdot|\cdot; \theta_F)$ , the backward policy $\pi_B(\cdot|\cdot; \phi)$ , the total flow estimator $Z(\theta_Z)$ , and potentially value functions $\widetilde{V}_F(\cdot; \eta_F)$ , $\widetilde{V}_B(\cdot; \eta_B)$ .
Policy Gradient Updates: Standard actor-critic updates are used.
- Compute policy-dependent rewards ( $R_F$ , $R_B$ , $R_B^G$ ) based on current network outputs.
- Estimate advantages (e.g., using GAE with $\lambda$ ).
- Update policy parameters ( $\theta_F$ , $\phi$ ) using policy gradients.
- Update value function parameters ( $\eta_F$ , $\eta_B$ ) by minimizing the squared error between estimated returns and value predictions.
- Update total flow parameter $\theta_Z$ using its gradient component (derived from $\nabla J_F$ or $\nabla \mathcal{L}_{TB}$ ).
TRPO Implementation: Requires computing the Fisher-vector product (using conjugate gradient) to solve the TRPO constraint efficiently and performing a line search for the step size.
Guided Policy: The guided distribution $P_G$ needs to be defined based on the specific task, potentially using heuristics or replay buffers as shown in the experiments (e.g., penalizing early termination in low-reward areas for hypergrids, using mean rewards from replay buffer for sequence design).
Hyperparameters: Key hyperparameters include learning rates, the GAE parameter $\lambda$ (ablation shows $\lambda \approx 0.99$ works well), and the TRPO trust region size $\zeta_F$ .

Experimental Findings:

Experiments on hyper-grids, biological/molecular sequence generation (SIX6, QM9, PHO4, sEH), and Bayesian Network structure learning show that the proposed policy-based methods (RL-U, RL-B, RL-T, RL-G) often achieve faster convergence than value-based methods (DB, TB, Sub-TB, TB-TS).
Policy-based methods, particularly RL-T (TRPO) and RL-G (Guided), often reach comparable or better final performance (measured by $D_{TV}$ , $D_{JSD}$ , Accuracy) compared to strong baselines like TB-U.
The results suggest that policy-based methods provide more robust gradient estimates, especially in high-dimensional or sparse reward settings.
The ablation paper on $\lambda$ confirms its role in balancing bias and variance, with values slightly less than 1 often outperforming $\lambda=1$ (which corresponds more closely to empirical returns used in TB).

Practical Implications:

Provides an alternative, potentially more robust and faster, way to train GFlowNets compared to existing methods like TB.
Leverages well-established RL algorithms (Actor-Critic, GAE, TRPO) and their implementations.
The framework allows systematically incorporating guidance or heuristics into the backward policy training (RL-G) for improved efficiency.
Offers flexibility in choosing the advantage estimator (via $\lambda$ ) to tune the bias-variance trade-off for gradient estimation based on the specific problem.
The TRPO variant (RL-T) offers a way to achieve more stable training updates.

In summary, the paper establishes a strong connection between GFlowNet training and policy-based RL, leading to new training algorithms that demonstrate empirical benefits in terms of convergence speed and robustness, particularly by leveraging advanced gradient estimation and stabilization techniques from the RL literature.

PDF Markdown