Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Revisiting Non-Acyclic GFlowNets in Discrete Environments (2502.07735v2)

Published 11 Feb 2025 in cs.LG and stat.ML

Abstract: Generative Flow Networks (GFlowNets) are a family of generative models that learn to sample objects from a given probability distribution, potentially known up to a normalizing constant. Instead of working in the object space, GFlowNets proceed by sampling trajectories in an appropriately constructed directed acyclic graph environment, greatly relying on the acyclicity of the graph. In our paper, we revisit the theory that relaxes the acyclicity assumption and present a simpler theoretical framework for non-acyclic GFlowNets in discrete environments. Moreover, we provide various novel theoretical insights related to training with fixed backward policies, the nature of flow functions, and connections between entropy-regularized RL and non-acyclic GFlowNets, which naturally generalize the respective concepts and theoretical results from the acyclic setting. In addition, we experimentally re-examine the concept of loss stability in non-acyclic GFlowNet training, as well as validate our own theoretical findings.

Summary

  • The paper presents a simplified theoretical framework for applying GFlowNets in discrete, cyclic environments by focusing on the backward policy.
  • It redefines flows as expected visit counts and establishes detailed balance conditions that guarantee finite-length trajectories under basic reachability assumptions.
  • The study proposes state flow regularization to jointly minimize trajectory length and improve reward matching, validated by experiments on hypergrid and permutation tasks.

This paper, "Revisiting Non-Acyclic GFlowNets in Discrete Environments" (2502.07735), provides a simplified theoretical framework for applying Generative Flow Networks (GFlowNets) to discrete state spaces where the transition graph may contain cycles. While standard GFlowNet theory [bengio2023gflownet] relies heavily on the environment being a directed acyclic graph (DAG), non-acyclic environments are relevant for tasks like discrete control, standard reinforcement learning problems, or generating objects with intrinsic symmetries (like permutations). The paper revisits and simplifies the theory proposed by previous work [brunswic2024theory] and offers new insights into training non-acyclic GFlowNets.

In standard acyclic GFlowNets, objects are generated by sampling trajectories in a DAG from an initial state s0s_0 to a terminal state sfs_f. The probability of sampling an object xx (which corresponds to a terminal state or the end of a trajectory) is designed to be proportional to a given reward function R(x)R(x). The generative process is defined by a forward policy $\PF(s' \mid s)$.

The core challenge in non-acyclic environments is that cycles can lead to trajectories of infinite length or arbitrarily large expected length, which complicates the definition and training of GFlowNets. The paper adapts the theoretical foundation by focusing on the backward policy $\PB(s \mid s')$ as the primary means to define a probability distribution over trajectories. If the backward policy is strictly positive for all edges and the graph structure satisfies basic reachability assumptions (path from s0s_0 to any state, and from any state to sfs_f), the backward policy induces a well-defined probability distribution over finite trajectories starting at s0s_0 and ending at sfs_f (Lemma 3.1). The expected length of trajectories sampled from this backward process is finite.

Crucially, the paper redefines state and edge flows. Unlike in acyclic graphs where flow F(ss)\mathcal{F}(s \to s') can be interpreted as the probability that a sampled trajectory traverses the edge (ss)(s \to s'), in non-acyclic graphs this intuition breaks down due to potential revisits. The paper proposes defining flows proportional to the expected number of visits to states or edges by a trajectory sampled from the backward process (Definition 3.2). This definition of flows, scaled by F(sf)\mathcal{F}(s_f) (the total flow into the sink state), satisfies the flow matching conditions F(s)=sout(s)F(ss)=sin(s)F(ss)\mathcal{F}(s) = \sum_{s' \in \text{out}(s)} \mathcal{F}(s \to s') = \sum_{s'' \in \text{in}(s)} \mathcal{F}(s'' \to s) and detailed balance-like conditions $\mathcal{F}(s \to s') = \mathcal{F}(s') \PB(s \mid s')$ (Proposition 3.3). Furthermore, for any strictly positive backward policy, there exists a unique forward policy $\PF(s' \mid s) = \frac{\mathcal{F}(s') \PB(s \mid s')}{\mathcal{F}(s)}$ that satisfies the detailed balance conditions $\mathcal{F}(s)\PF(s' \mid s) = \mathcal{F}(s')\PB(s \mid s')$ and induces the same trajectory distribution as $\PB$ (Proposition 3.4).

The practical goal of GFlowNets is to learn a forward policy $\PF$ that samples terminal states (objects) xXx \in \mathcal{X} with probability proportional to R(x)\mathcal{R}(x). This reward matching condition is equivalent to ensuring that the flow into the sink state from any terminal state xx is proportional to R(x)\mathcal{R}(x), i.e., F(xsf)=R(x)\mathcal{F}(x \to s_f) = \mathcal{R}(x) (Proposition 3.6).

The paper discusses two main training scenarios:

  1. Fixed Backward Policy: If the backward policy $\PB$ is fixed (e.g., hand-designed to satisfy reward matching at sfs_f and be near-uniform otherwise), the theoretical results guarantee that there is a unique forward policy $\PF$ that matches $\PB$ and satisfies reward matching, and the induced trajectories have finite expected length (Corollary 3.7). In this setting, standard GFlowNet losses developed for acyclic graphs (like Detailed Balance (DB) loss [bengio2023gflownet]) can be used to train the forward policy (and optionally flows). The concept of "loss stability" introduced in [brunswic2024theory] to prevent flows from growing uncontrollably along cycles does not impact the final result, as a unique finite-length solution exists. The main challenge is that a hand-picked fixed $\PB$ might result in a very large, albeit finite, expected trajectory length, making forward sampling inefficient.
  2. Trainable Backward Policy: Training $\PB$ allows the GFlowNet to potentially find a solution with a smaller expected trajectory length. However, standard losses like DB can lead to unstable training where the expected trajectory length grows unbounded if not addressed [brunswic2024theory]. The paper provides a key insight: the expected trajectory length is equal to the normalized total state flow 1F(sf)sS{s0,sf}F(s)\frac{1}{\mathcal{F}(s_f)}\sum_{s \in \mathcal{S} \setminus \{s_0, s_f\}} \mathcal{F}(s) (Proposition 3.8). Therefore, minimizing expected trajectory length is equivalent to minimizing the total state flow.

Based on this equivalence, the paper proposes adding a state flow regularization term, λF(s)\lambda \mathcal{F}(s), to the standard loss function, such as the DB loss (Equation 3.10). This encourages the model to find a solution that minimizes the total flow while satisfying the detailed balance conditions and reward matching.

The paper also explores the "Scaling Hypothesis": The practical stability issues might stem from the scale at which flow errors are computed. Standard DB loss operates on log-flows (ΔlogF\Delta \log \mathcal{F}), while the stable DB (SDB) loss from [brunswic2024theory] and other flow-based losses operate on flow differences (ΔF\Delta \mathcal{F}). The hypothesis is that ΔF\Delta \mathcal{F} scale losses are biased towards smaller flows because their gradients diminish rapidly when flows are small, making it harder to increase them. This intrinsic bias might provide a form of stability.

From an implementation perspective, GFlowNets in this setting are typically parameterized by neural networks predicting log-flows (logFθ(s)\log \mathcal{F}_\theta(s)), forward policy logits, and backward policy logits. Training involves optimizing these parameters to satisfy the detailed balance equations and reward matching, potentially with the proposed state flow regularization. The training objective involves sampling transitions or trajectories and applying the loss, similar to reinforcement learning training procedures.

The paper generalizes the equivalence between GFlowNet training (specifically with a fixed backward policy and reward matching) and entropy-regularized reinforcement learning [tiapkin2024generative] to the non-acyclic setting (Theorem 3.11). This connection suggests that algorithms and insights from entropy-regularized RL could be beneficial for training non-acyclic GFlowNets.

Experimental results on non-acyclic hypergrid and permutation environments validate the theoretical findings:

  • Training with a fixed $\PB$ works with standard (unstable in the sense of [brunswic2024theory]) losses, confirming Corollary 3.7.
  • When $\PB$ is trainable, standard losses in ΔlogF\Delta \log \mathcal{F} scale can lead to unbounded expected trajectory lengths without regularization, especially on larger environments (Figure 3 in Appendix).
  • Losses in ΔF\Delta \mathcal{F} scale (like the SDB variant) tend to yield smaller expected trajectory lengths, supporting the scaling hypothesis, but may struggle to accurately match the reward distribution on more complex tasks.
  • Adding state flow regularization to ΔlogF\Delta \log \mathcal{F} losses (both standard DB and SDB in log scale) successfully controls the expected trajectory length while achieving better reward distribution matching compared to ΔF\Delta \mathcal{F} losses (Figure 2, Table 1). The regularization strength allows tuning the trade-off between low expected length and high sampling accuracy.

In conclusion, the paper provides a clear theoretical foundation for discrete non-acyclic GFlowNets, highlights the critical role of the backward policy and expected number of visits in defining flows, and demonstrates that controlling the total state flow via regularization is a practical way to manage expected trajectory length during training with a learnable backward policy, even when using standard "unstable" losses. The empirical evidence suggests that ΔlogF\Delta \log \mathcal{F} losses with regularization are a promising approach for accurate sampling with controlled trajectory length in non-acyclic environments. Future work could explore alternative losses, leverage RL techniques more deeply, and address specific non-acyclic environment structures.