GFP: Value-Aware Flow Policy in Offline RL

Updated 10 December 2025

Guided Flow Policy (GFP) is a value-aware offline RL method that leverages flow-based generative modeling to prioritize high-value actions in the learning process.
It combines a multi-step Value-aware Behavior Cloning (VaBC) flow with a distilled one-step actor to efficiently imitate rewarding actions from offline datasets.
Empirical evaluations demonstrate that GFP outperforms traditional methods on state and pixel benchmarks, particularly in noisy or suboptimal environments.

A Guided Flow Policy (GFP) is a behavior-regularized offline reinforcement learning (RL) algorithm that directly incorporates value information into the policy regularization process via flow-based generative modeling. GFP addresses the limitations of traditional behavior-regularized actor–critic (BRAC) frameworks, which indiscriminately regularize towards all actions in the dataset without regard to their value, by guiding the learning of the policy distribution towards high-value actions. This approach leverages two tightly coupled components—a multi-step flow-matching policy (the Value-aware Behavior Cloning, or VaBC, flow) and a distilled one-step actor—to achieve efficient and value-aware imitation from offline datasets and to constrain policy updates within the high-reward support of the data. GFP has demonstrated state-of-the-art performance across a large suite of state and pixel-based offline RL benchmarks and is architected for expressive policy learning, dataset alignment, and sample efficiency (Tiofack et al., 3 Dec 2025, Alles et al., 20 May 2025).

1. Motivation and Conceptual Foundations

Offline RL algorithms seek to learn policies from fixed datasets, necessitating explicit mechanisms to avoid overestimation on out-of-distribution actions and to stay close to the support of the available data. Classic BRAC methods add a behavior-cloning penalty term to the actor loss:

$L_A(\theta) = \mathbb{E}_{s \sim D, a_\theta \sim \pi_\theta(\cdot|s)} [ -Q_\phi(s,a_\theta) + \alpha \|a_\theta - a\|^2 ]$

where $\alpha$ trades off value maximization and proximity to the dataset. However, this penalty does not distinguish between good and poor actions and can cause policies to gravitate towards suboptimal behaviors present in the data. GFP introduces value-awareness into the regularization itself: the policy is guided by a flow-modulated distribution that upweights high-Q actions, and the behavioral actor is regularized to remain close to this flow.

This mutual guidance ensures that (i) the flow policy preferentially clones high-value actions, and (ii) the distilled actor maximizes expected return while being restricted to regions of the action space with demonstrated performance in the dataset (Tiofack et al., 3 Dec 2025).

2. Architecture and Mechanistic Overview

GFP consists of three core entities, trained in a bidirectional, mutually guiding loop:

Value-aware multi-step flow-matching policy $\pi_\omega(a|s)$ (VaBC): This component models the dataset action distribution as a time-dependent flow parameterized by a velocity field $v_\omega(t,s,x)$ such that the solution of the ODE

$\frac{d}{dt} \psi_\omega(t, s, z) = v_\omega(t, s, \psi_\omega(t, s, z)),\quad \psi_\omega(0, s, z) = z \sim \mathcal{N}(0, I_d)$

produces $a \approx \psi_\omega(1,s,z)$ . The flow is trained via a weighted behavior cloning objective in which weights depend on the critic, biasing the modeled distribution towards high-return transitions (Tiofack et al., 3 Dec 2025).

Distilled one-step actor $\pi_\theta(a|s)$ : Direct backpropagation through the flow’s ODE is computationally prohibitive. To mitigate this, GFP distills the flow’s behavior into a parameter-efficient, single-step policy, typically parameterized as an implicit mapping $\mu_\theta(s,z)$ where $z \sim \mathcal{N}(0,I)$ . The actor is trained to (i) maximize return under the current critic, and (ii) remain close to the VaBC flow distribution, enabling fast inference and stable training (Tiofack et al., 3 Dec 2025).
Critic $Q_\phi(s,a)$ : Evaluates state-action pairs and provides the ranking signal necessary to upweight (respectively, downweight) high- (low-) value actions during the flow-guided learning. The critic is updated using Bellman-consistent losses with targets that can incorporate both actor and flow policy samples (Tiofack et al., 3 Dec 2025).

The training alternates through three phases per minibatch:

Critic update using Bellman targets.
Actor update that maximizes Q while distilling the flow policy.
VaBC/flow update via weighted flow-matching, using a guidance function that biases towards dataset actions with $Q_\phi(s,a) \gg Q_\phi(s,\mu_\theta(s,z))$ .

3. Flow-Matching Losses and Mutual Guidance Mechanisms

The flow loss is constructed using a guidance function $g_\eta(s, a)$ that interpolates focus between the dataset action and the current actor’s proposal:

$g_\eta(s, a) = \frac{\exp[(\lambda/\eta) Q_\phi(s, a)]}{\exp[(\lambda/\eta) Q_\phi(s, a)] + \exp[(\lambda/\eta) Q_\phi(s, \mu_\theta(s, z))]}$

The VaBC flow-matching loss is then given by:

$L_{\mathrm{VaBC}}(\omega) = \mathbb{E}_{(s,a) \sim D, \epsilon \sim \mathcal{N}(0,I), t \sim U(0,1)} [ g_\eta(s,a) \|v_\omega(t, s, a_t) - (a-\epsilon)\|_2^2 ]$

where $a_t = (1-t)\epsilon + t a$ . This loss focuses learning on transitions with high Q and prevents mode collapse during early training when the critic may be inaccurate. The regularization coefficient $\eta$ controls the selectivity–diversity trade-off; very low $\eta$ may over-concentrate on high-Q actions and harm learning stability (Tiofack et al., 3 Dec 2025).

The actor update solves:

$L_A(\theta) = \mathbb{E}_{s \sim D, z \sim \mathcal{N}(0,I)} [ -\lambda Q_\phi(s, \mu_\theta(s,z)) + \alpha \|\mu_\theta(s,z) - \mu_{\bar\omega}(s,z)\|_2^2 ]$

ensuring the distilled actor seeks actions that are both valuable and lie close to the flow distribution.

4. Theoretical Properties and Convergence

The velocity field matching performed by the VaBC flow policy is grounded in flow-matching objectives that, under mild conditions, provably converge to the data distribution in function space. The guidance weights $g_\eta$ maintain stability by preventing the flow from collapsing onto a negligible subset of the data support when the critic is inaccurate early in training. GFP inherits local convergence guarantees from BRAC, enhanced by the expressiveness of flow-based policies. However, global convergence guarantees for the joint actor–flow–critic system remain open (Tiofack et al., 3 Dec 2025).

5. Empirical Evaluation and Comparative Results

GFP has been evaluated on OGBench, Minari, and D4RL, covering 144 tasks across state-based and pixel-based benchmarks:

Benchmark	Tasks	GFP π_θ Score	Comparison Baseline	Baseline Score
OGBench (state)	105	53.2	FQL	46.7
Minari (state+pixel)	21	74.1	FQL	65.9
D4RL	18	63.0	ReBRAC	64.8
Suboptimal/noisy datasets	cube-double/triple-noisy	+25–+40 vs FQL	FQL	—

GFP consistently matches or outperforms state-of-the-art baselines including IQL, ReBRAC, FQL, CQL, TD3+BC, and others. Notably, its advantage is pronounced on datasets featuring high suboptimality or additional noise, due to its value-aware mechanism for action selection (Tiofack et al., 3 Dec 2025).

Ablation studies reveal optimality for $\eta \approx 10^{-3}$ , balancing selectivity and action diversity. Lower $\eta$ increases selectivity but can impair critic learning.

6. Algorithmic Efficiency, Hyperparameters, and Limitations

GFP’s multi-step ODE-based flow increases training time by approximately 20% versus conventional BRAC; however, inference via the distilled actor is extremely efficient, requiring only a single forward pass. Primary hyperparameters are $\alpha$ (BC regularization strength) and $\eta$ (guidance temperature), with typical ranges $\alpha \in [10^{-2}, 10^{-1}]$ , $\eta \in [10^{-5}, 10^{-2}]$ . Larger batch sizes (up to 1024) are favored for certain tasks; the discount factor $\gamma$ is environment-specific (Tiofack et al., 3 Dec 2025).

GFP’s performance depends on the accuracy of the critic $Q_\phi$ ; poor critic quality can lead to suboptimal guidance. The framework currently lacks global convergence proofs for the coupled actor–flow–critic updates. Extensions such as advantage-weighted guidance and richer policy-context conditioning remain promising avenues for further improvement.

7. Relationship to FlowQ and Alternative Flow-Based Methods

FlowQ (Alles et al., 20 May 2025) exemplifies an alternative energy-guided flow-matching paradigm in offline RL, where the flow policy is trained to match a distribution proportional to the behavior policy and exponentiated Q-values:

$\pi(a | s) \propto \pi_\beta(a | s)\exp(Q(s,a))$

FlowQ leverages a Gaussian approximation and matches the ODE flow to a Q-guided target at every point, eliminating the need for additional guidance at inference. Its training time remains constant with the number of flow steps, in contrast to diffusion-based approaches, which incur linear scaling with the trajectory length of reverse-diffusion. Both GFP and FlowQ outperform or match existing diffusion and flow-based policies across benchmarks, illustrating the general efficacy of flow-matching with direct value guidance in offline RL (Alles et al., 20 May 2025).

References

"Guided Flow Policy: Learning from High-Value Actions in Offline Reinforcement Learning" (Tiofack et al., 3 Dec 2025)
"FlowQ: Energy-Guided Flow Policies for Offline Reinforcement Learning" (Alles et al., 20 May 2025)

PDF Markdown Chat (Pro)

References (2)

Guided Flow Policy: Learning from High-Value Actions in Offline Reinforcement Learning (2025)

FlowQ: Energy-Guided Flow Policies for Offline Reinforcement Learning (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Guided Flow Policy (GFP).