- The paper presents a simplified theoretical framework for applying GFlowNets in discrete, cyclic environments by focusing on the backward policy.
- It redefines flows as expected visit counts and establishes detailed balance conditions that guarantee finite-length trajectories under basic reachability assumptions.
- The study proposes state flow regularization to jointly minimize trajectory length and improve reward matching, validated by experiments on hypergrid and permutation tasks.
This paper, "Revisiting Non-Acyclic GFlowNets in Discrete Environments" (2502.07735), provides a simplified theoretical framework for applying Generative Flow Networks (GFlowNets) to discrete state spaces where the transition graph may contain cycles. While standard GFlowNet theory [bengio2023gflownet] relies heavily on the environment being a directed acyclic graph (DAG), non-acyclic environments are relevant for tasks like discrete control, standard reinforcement learning problems, or generating objects with intrinsic symmetries (like permutations). The paper revisits and simplifies the theory proposed by previous work [brunswic2024theory] and offers new insights into training non-acyclic GFlowNets.
In standard acyclic GFlowNets, objects are generated by sampling trajectories in a DAG from an initial state s0 to a terminal state sf. The probability of sampling an object x (which corresponds to a terminal state or the end of a trajectory) is designed to be proportional to a given reward function R(x). The generative process is defined by a forward policy $\PF(s' \mid s)$.
The core challenge in non-acyclic environments is that cycles can lead to trajectories of infinite length or arbitrarily large expected length, which complicates the definition and training of GFlowNets. The paper adapts the theoretical foundation by focusing on the backward policy $\PB(s \mid s')$ as the primary means to define a probability distribution over trajectories. If the backward policy is strictly positive for all edges and the graph structure satisfies basic reachability assumptions (path from s0 to any state, and from any state to sf), the backward policy induces a well-defined probability distribution over finite trajectories starting at s0 and ending at sf (Lemma 3.1). The expected length of trajectories sampled from this backward process is finite.
Crucially, the paper redefines state and edge flows. Unlike in acyclic graphs where flow F(s→s′) can be interpreted as the probability that a sampled trajectory traverses the edge (s→s′), in non-acyclic graphs this intuition breaks down due to potential revisits. The paper proposes defining flows proportional to the expected number of visits to states or edges by a trajectory sampled from the backward process (Definition 3.2). This definition of flows, scaled by F(sf) (the total flow into the sink state), satisfies the flow matching conditions F(s)=s′∈out(s)∑F(s→s′)=s′′∈in(s)∑F(s′′→s) and detailed balance-like conditions $\mathcal{F}(s \to s') = \mathcal{F}(s') \PB(s \mid s')$ (Proposition 3.3). Furthermore, for any strictly positive backward policy, there exists a unique forward policy $\PF(s' \mid s) = \frac{\mathcal{F}(s') \PB(s \mid s')}{\mathcal{F}(s)}$ that satisfies the detailed balance conditions $\mathcal{F}(s)\PF(s' \mid s) = \mathcal{F}(s')\PB(s \mid s')$ and induces the same trajectory distribution as $\PB$ (Proposition 3.4).
The practical goal of GFlowNets is to learn a forward policy $\PF$ that samples terminal states (objects) x∈X with probability proportional to R(x). This reward matching condition is equivalent to ensuring that the flow into the sink state from any terminal state x is proportional to R(x), i.e., F(x→sf)=R(x) (Proposition 3.6).
The paper discusses two main training scenarios:
- Fixed Backward Policy: If the backward policy $\PB$ is fixed (e.g., hand-designed to satisfy reward matching at sf and be near-uniform otherwise), the theoretical results guarantee that there is a unique forward policy $\PF$ that matches $\PB$ and satisfies reward matching, and the induced trajectories have finite expected length (Corollary 3.7). In this setting, standard GFlowNet losses developed for acyclic graphs (like Detailed Balance (DB) loss [bengio2023gflownet]) can be used to train the forward policy (and optionally flows). The concept of "loss stability" introduced in [brunswic2024theory] to prevent flows from growing uncontrollably along cycles does not impact the final result, as a unique finite-length solution exists. The main challenge is that a hand-picked fixed $\PB$ might result in a very large, albeit finite, expected trajectory length, making forward sampling inefficient.
- Trainable Backward Policy: Training $\PB$ allows the GFlowNet to potentially find a solution with a smaller expected trajectory length. However, standard losses like DB can lead to unstable training where the expected trajectory length grows unbounded if not addressed [brunswic2024theory]. The paper provides a key insight: the expected trajectory length is equal to the normalized total state flow F(sf)1s∈S∖{s0,sf}∑F(s) (Proposition 3.8). Therefore, minimizing expected trajectory length is equivalent to minimizing the total state flow.
Based on this equivalence, the paper proposes adding a state flow regularization term, λF(s), to the standard loss function, such as the DB loss (Equation 3.10). This encourages the model to find a solution that minimizes the total flow while satisfying the detailed balance conditions and reward matching.
The paper also explores the "Scaling Hypothesis": The practical stability issues might stem from the scale at which flow errors are computed. Standard DB loss operates on log-flows (ΔlogF), while the stable DB (SDB) loss from [brunswic2024theory] and other flow-based losses operate on flow differences (ΔF). The hypothesis is that ΔF scale losses are biased towards smaller flows because their gradients diminish rapidly when flows are small, making it harder to increase them. This intrinsic bias might provide a form of stability.
From an implementation perspective, GFlowNets in this setting are typically parameterized by neural networks predicting log-flows (logFθ(s)), forward policy logits, and backward policy logits. Training involves optimizing these parameters to satisfy the detailed balance equations and reward matching, potentially with the proposed state flow regularization. The training objective involves sampling transitions or trajectories and applying the loss, similar to reinforcement learning training procedures.
The paper generalizes the equivalence between GFlowNet training (specifically with a fixed backward policy and reward matching) and entropy-regularized reinforcement learning [tiapkin2024generative] to the non-acyclic setting (Theorem 3.11). This connection suggests that algorithms and insights from entropy-regularized RL could be beneficial for training non-acyclic GFlowNets.
Experimental results on non-acyclic hypergrid and permutation environments validate the theoretical findings:
- Training with a fixed $\PB$ works with standard (unstable in the sense of [brunswic2024theory]) losses, confirming Corollary 3.7.
- When $\PB$ is trainable, standard losses in ΔlogF scale can lead to unbounded expected trajectory lengths without regularization, especially on larger environments (Figure 3 in Appendix).
- Losses in ΔF scale (like the SDB variant) tend to yield smaller expected trajectory lengths, supporting the scaling hypothesis, but may struggle to accurately match the reward distribution on more complex tasks.
- Adding state flow regularization to ΔlogF losses (both standard DB and SDB in log scale) successfully controls the expected trajectory length while achieving better reward distribution matching compared to ΔF losses (Figure 2, Table 1). The regularization strength allows tuning the trade-off between low expected length and high sampling accuracy.
In conclusion, the paper provides a clear theoretical foundation for discrete non-acyclic GFlowNets, highlights the critical role of the backward policy and expected number of visits in defining flows, and demonstrates that controlling the total state flow via regularization is a practical way to manage expected trajectory length during training with a learnable backward policy, even when using standard "unstable" losses. The empirical evidence suggests that ΔlogF losses with regularization are a promising approach for accurate sampling with controlled trajectory length in non-acyclic environments. Future work could explore alternative losses, leverage RL techniques more deeply, and address specific non-acyclic environment structures.