Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Understanding and Improving GFlowNet Training (2305.07170v1)

Published 11 May 2023 in cs.LG

Abstract: Generative flow networks (GFlowNets) are a family of algorithms that learn a generative policy to sample discrete objects $x$ with non-negative reward $R(x)$. Learning objectives guarantee the GFlowNet samples $x$ from the target distribution $p*(x) \propto R(x)$ when loss is globally minimized over all states or trajectories, but it is unclear how well they perform with practical limits on training resources. We introduce an efficient evaluation strategy to compare the learned sampling distribution to the target reward distribution. As flows can be underdetermined given training data, we clarify the importance of learned flows to generalization and matching $p*(x)$ in practice. We investigate how to learn better flows, and propose (i) prioritized replay training of high-reward $x$, (ii) relative edge flow policy parametrization, and (iii) a novel guided trajectory balance objective, and show how it can solve a substructure credit assignment problem. We substantially improve sample efficiency on biochemical design tasks.

Citations (43)

Summary

  • The paper's main contribution is introducing novel training techniques to improve GFlowNet's match to target reward distributions.
  • It employs prioritized replay, relative edge flow parametrization, and guided trajectory balance to enhance sample efficiency and credit assignment.
  • Empirical results on biochemical design tasks demonstrate up to 9-fold speedups and improved convergence, highlighting practical impact.

Generative Flow Networks (GFlowNets) are machine learning algorithms designed to learn a policy for generating discrete objects xx such that the probability of generating an object is proportional to a given non-negative reward function R(x)R(x), i.e., p(x)R(x)p(x) \propto R(x). This objective is useful for tasks like discovering high-reward objects in large, discrete spaces, such as designing molecules or biological sequences.

The paper "Towards Understanding and Improving GFlowNet Training" (Towards Understanding and Improving GFlowNet Training, 2023) investigates practical challenges in training GFlowNets and proposes new methods to improve their performance, particularly focusing on their ability to accurately match the target distribution p(x)R(x)p^*(x) \propto R(x) under realistic computational constraints.

Core Challenges in GFlowNet Training:

The paper identifies several key challenges:

  1. Underfitting the Target Distribution: While GFlowNet objectives theoretically guarantee matching p(x)p^*(x) at global minimum, achieving this in practice on large state spaces is difficult. The paper shows empirically that trained GFlowNets often underfit, particularly by oversampling low-reward objects, even after extensive training. This is observed by comparing the mean reward of samples from the GFlowNet policy to the target mean reward, which is typically much lower than the target at the start of training and struggles to catch up.
  2. Sample Efficiency: The process of improving the sampling distribution to match p(x)p^*(x) can be very slow, requiring a large number of training steps and samples.
  3. Underdetermined Flows and Generalization: In many interesting generative Markov Decision Processes (MDPs) used by GFlowNets (e.g., generating graphs or sequences with multiple insertion points), there are exponentially many trajectories that can lead to the same final object xx. This means the total "flow" assigned to xx, proportional to R(x)R(x), can be distributed among these trajectories in many ways. The paper highlights that how this flow is distributed across intermediate states and edges (the "learned flow distribution") significantly impacts the GFlowNet's ability to generalize to unseen parts of the state space and thus match the target distribution effectively.
  4. Substructure Credit Assignment: For compositional objects (where the reward depends on substructures), a key challenge is assigning credit (or flow) to the intermediate states corresponding to high-reward substructures. Existing objectives like Trajectory Balance (TB) and Maximum Entropy (MaxEnt) can under-credit these important substructures, especially if they are not frequently visited on training trajectories, hindering generalization.

Evaluation Strategy:

To provide a more precise empirical understanding of GFlowNet training dynamics, the paper uses benchmarks where the space of terminal objects (X\mathcal{X}) is enumerable. This allows for evaluating how well the GFlowNet's learned sampling distribution pθ(x)p_\theta(x) matches the target p(x)p^*(x) by comparing properties of sampled rewards to those of the target distribution. The primary metrics used are the Anderson-Darling statistic (a goodness-of-fit test) and, more intuitively, the relative error between the mean reward of GFlowNet samples Epθ(x)[R(x)]\mathbb{E}_{p_\theta(x)}[R(x)] and the target mean reward Ep(x)[R(x)]\mathbb{E}_{p^*(x)}[R(x)].

Proposed Improvements:

The paper introduces three main modifications to standard GFlowNet training to address the identified challenges:

  1. Reward-Prioritized Replay Training (PRT): This strategy aims to combat the issue of oversampling low-reward objects. In addition to training on newly sampled data, it incorporates a replay buffer of previously seen objects xx. When sampling from the replay buffer, it prioritizes high-reward objects. This ensures the GFlowNet sees and learns from high-reward examples more frequently, helping it to shift probability mass towards them faster. The implementation involves sampling a portion of the replay batch from high-reward percentiles and the rest from low-reward percentiles.
  2. Relative Edge Flow Parametrization (SSR): This is an alternative neural network parametrization for the forward policy PFP_F. Instead of predicting action probabilities directly from a state ss (the common "SA" parametrization), SSR parameterizes relative edge flows fθ(s,s)f_\theta(s, s') using a neural network that takes both the current state ss and a potential next state ss' as input. PF(ss)P_F(s'|s) is then computed as fθ(s,s)/cchildren(s)fθ(s,c)f_\theta(s, s') / \sum_{c \in children(s)} f_\theta(s, c). The intuition is that allowing the policy network to "see" the potential child state ss' might lead to better generalization by helping the GFlowNet learn which types of transitions or substructures are associated with favorable flow.
  3. Guided Trajectory Balance (GTB): This is a novel learning objective designed to give more flexible control over how flow is distributed across trajectories, specifically to address the substructure credit assignment problem. GTB introduces a "guide distribution" p(τx)p(\tau_{\rightarrow x}) over trajectories ending in xx. The objective encourages the GFlowNet's learned trajectory distribution PF(τx)P_F(\tau_{\rightarrow x}) to be proportional to p(τx)R(x)p(\tau_{\rightarrow x}) R(x).
    • Substructure Guide: A specific instance of GTB is proposed for compositional reward functions. This guide favors trajectories that pass through substructures observed in the set of previously seen high-reward objects (XX). The guide's transition probability p(st+1st,x,X)p(s_{t+1}|s_t, x, X) is defined based on a score ϕ(sx,X)\phi(s'|x, X) for child states ss', where ϕ\phi is the average reward of objects xXx' \in X (excluding the current target xx) that contain ss' as a substructure. This guides the GFlowNet to attribute credit to empirically relevant substructures.
    • Implementation: GTB can be implemented by training a backward policy PBP_B to approximate the guide p(τx)p(\tau_{\rightarrow x}) and then using this learned PBP_B in the standard TB objective. Alternatively, one can use a mixed objective that combines the learned PBP_B term from standard TB with the guide term p(τx)p(\tau_{\rightarrow x}). The paper's experiments primarily used a mixing strategy.

Theoretical Insights on Credit Assignment:

The paper provides theoretical analysis in a simplified "sequence prepend/append" MDP setting to illustrate the credit assignment behavior of different objectives. It shows that:

  • MaxEnt: Tends to distribute flow as widely as possible, assigning minimal credit to a shared, "important" substructure ss^* that contributes equally to the reward of two different objects x,xx, x'. The proportion of flow through ss^* decreases as the number of alternative substructures grows.
  • TB: Can exhibit a "rich-get-richer" dynamic akin to a Pólya urn process, where trajectories with initially higher flow are more likely to be sampled and reinforced. This can lead to flow being concentrated on arbitrary trajectories rather than necessarily those containing important substructures, potentially reducing credit assigned to ss^*.
  • Substructure GTB: By construction, the substructure guide explicitly directs flow through the important shared substructure ss^*, theoretically assigning maximal credit to it at convergence in the simplified setting.

Experimental Results and Practical Implications:

Experiments on several biochemical design tasks (Bag, SIX6, PHO4, QM9, sEH) demonstrate the effectiveness of the proposed methods:

  • Improved Sample Efficiency: Models incorporating PRT, SSR, and/or Substructure GTB consistently reached the target mean reward (or a higher mean if convergence was below target) significantly faster than baseline TB and MaxEnt. For sEH, the speedup was up to 9-fold.
  • Better Convergence: On tasks like SIX6, where baselines struggled to match the target mean within a reasonable training budget, the Substructure GTB model was the only one to successfully reach the target mean.
  • Impact of Proposals: PRT generally improved sample efficiency by focusing training on high-reward data. SSR showed a meaningful impact by altering generalization behavior and improving sample efficiency on several tasks. Substructure GTB had the biggest impact on tasks like SIX6 and sEH, confirming that explicitly guiding credit assignment towards empirically relevant substructures can be crucial.
  • TB vs. MaxEnt: Empirically, TB and MaxEnt showed similar training behavior and convergence in the tested environments.
  • MDP Choice Matters: The choice of MDP impacts the number of trajectories per object and thus the potential for flow underdetermination and the relevance of substructure credit assignment. For SIX6, using a prepend/append MDP (with multiple trajectories per string) combined with substructure guidance outperformed training on an autoregressive MDP (with one trajectory per string), suggesting that non-autoregressive MDPs can be beneficial when combined with appropriate training strategies.
  • Mode Discovery and Diversity: Improved performance in matching the target distribution also led to better mode discovery (finding more high-reward objects), while maintaining comparable sample diversity among top-reward samples.

In summary, this paper provides valuable insights into the practical challenges of training GFlowNets to match target reward distributions. It highlights underfitting and credit assignment as major hurdles and proposes practical solutions—Prioritized Replay Training, Relative Edge Flow Parametrization, and Substructure-Guided Trajectory Balance—that demonstrate substantial improvements in sample efficiency and convergence on real-world inspired tasks. The work also underscores the importance of considering learned flow distributions and compositional structure in object generation MDPs.