Reparameterization Flow Policy Optimization

Updated 10 February 2026

RFO is a reinforcement learning method that trains continuous normalizing flow policies using reparameterization gradients, bypassing intractable likelihood evaluations.
It employs Conditional Flow Matching regularization to ensure stability through past-data consistency and enhanced exploration via uniform-target adjustments.
Empirical results show RFO achieves state-of-the-art performance in complex control tasks, outperforming prior baselines with improved sample efficiency and robustness.

Reparameterization Flow Policy Optimization (RFO) is a reinforcement learning (RL) methodology for training highly expressive, continuous normalizing flow (CNF) policies, leveraging the reparameterization (pathwise derivative) gradient to enable stable and sample-efficient learning. RFO unifies flow-based policy parameterizations with differentiable (often model-based) RL by allowing gradients to be backpropagated directly through the flow-generator ODEs and, if present, the environment dynamics, yielding policy optimization without intractable likelihood evaluations. The RFO framework, introduced in (Zhong et al., 3 Feb 2026), integrates regularization terms for stability and exploration, and in practical instantiations, supports both on-policy and off-policy algorithm designs. RFO has demonstrated state-of-the-art performance in diverse continuous control and manipulation tasks.

1. Foundations: Reparameterization Policy Gradients and Flow Policies

The reparameterization policy gradient (RPG) framework in model-based RL exploits differentiable simulators to compute control gradients via the pathwise (reparameterization) method rather than classic likelihood-ratio approaches. The policy is represented as

$a_t = f_\theta(\varepsilon_t; s_t),\qquad \varepsilon_t \sim p_0(\varepsilon),$

with expected return objective

$J(\theta) = \mathbb{E}_{\tau} \Bigl[ \sum_{t=0}^\infty \gamma^t r(s_t, a_t) \Bigr].$

Unlike standard Gaussian policies, flow-based policies use a state-conditioned CNF to transform base noise to actions via the ODE

$\frac{dx}{du} = v_\theta(x(u),u \mid s),\qquad x(0) = x_0 \sim p_0,$

with the final action $a = x(1)$ . This framework allows flexible, potentially multimodal action sampling, but naive application with RPG leads to instability and poor exploration, necessitating specialized regularization schemes (Zhong et al., 3 Feb 2026).

2. Flow Policy ODE Parameterizations and Training Without Log-Likelihoods

A flow policy parameterizes the stochastic mapping from base noise to agent actions as an invertible, differentiable time-dependent ODE flow: $x(0)=x_0 \ \rightarrow \ x(1)=T_\theta(x_0; s), \quad a = x(1).$ The action-conditional density is formally given by the change-of-variables formula, but RFO sidesteps the computation of the inverse flow and associated Jacobian determinants: $\pi_\theta(a|s) = p_0(T_\theta^{-1}(a;s)) \cdot \left\lvert \det \left( \frac{\partial T_\theta^{-1}(a;s)}{\partial a} \right) \right\rvert.$ Instead, RFO optimizes $\theta$ solely by direct backpropagation through the ODE-solver and, as applicable, the system dynamics, avoiding any log-likelihood evaluations (Zhong et al., 3 Feb 2026).

Practical implementations adopt numerical integration (Euler or higher-order methods, typically $K=4$ steps), and rely on deep neural approximations for the velocity field $v_\theta$ .

3. Stability and Exploration: Conditional Flow Matching Regularization

Training instability arises from the invertibility of the flow ODE and the possibility of subsequent updates breaking action trajectories previously visited—making the policy unable to re-sample those actions. RFO introduces two Conditional Flow Matching (CFM) regularization losses:

Past-Data CFM Regularization (Stability):

$\mathcal{L}_{\text{past}}(\theta) = \mathbb{E}_{u, \epsilon, (s,a) \sim \mathcal{D}_{\text{recent}}} \left\| v_\theta((1-u)\epsilon + u a, u \mid s) - (a - \epsilon) \right\|^2.$

This term ensures the vector field remains consistent with recently visited action trajectories.

Uniform-Target CFM Regularization (Exploration):

$\mathcal{L}_{\text{uni}}(\theta) = \mathbb{E}_{u, \epsilon, s, a \sim \mathrm{Uniform}(\mathcal{A})} \left\| v_\theta((1-u)\epsilon + u a, u \mid s) - (a - \epsilon) \right\|^2.$

This term encourages the policy to cover the entirety of the action space, improving exploration.

The overall loss combines the (negative) short-horizon return with the above regularizers. Empirical ablations show that both are essential for robust optimization; omitting either degrades performance to earlier RPG baselines (Zhong et al., 3 Feb 2026).

4. Algorithmic Structure and Action Chunking Variant

The canonical RFO training loop follows the Short-Horizon Actor-Critic (SHAC) style:

Collect short-horizon rollout trajectories under the current policy.
Augment recent-action and rollout buffers for use in CFM regularization.
Compute the reparameterized gradient of the short-horizon return proxy by BPTT through both the flow and simulator.
Apply $\mathcal{L}_{\text{past}}$ and $\mathcal{L}_{\text{uni}}$ regularization gradients.
Update policy parameters $\theta$ and critic parameters $\omega$ .

The action-chunking variant extends the flow policy to emit action sequences of length $C>1$ at each step,

$(a_t, a_{t+1}, \ldots, a_{t+C-1}) = T_\theta(\epsilon_t; s_t),$

which are executed in order before the next observation. The optimization is carried out on blocked segments, with the flow ODE and its regularizers extended to the higher-dimensional chunked action space (Zhong et al., 3 Feb 2026).

5. Empirical Results and Comparative Evaluation

RFO has been validated on a wide spectrum of RL benchmarks with differentiable physics (DFlex, Rewarped):

Locomotion: Ant ( $\mathbb{R}^8$ ), ANYmal ( $\mathbb{R}^{12}$ ), Soft Jumper ( $\mathbb{R}^{222}$ , visual).
Manipulation: Hand Reorient, Rolling Pin (pixel/state), Hand Flip, Transport.

The benchmarks include both state and pixel-based tasks and both rigid and soft-body dynamics.

In all reported environments, RFO achieves competitive or strictly superior final returns relative to SHAC, SAPO, FlowRL, DrAC, and other diffusion-based or flow-matching baselines. Notably, in the high-dimensional Soft Jumper, RFO attains nearly $2\times$ the return of the best previous method. The following table presents mean normalized performance (SHAC=1.0):

Task	SHAC	SAPO	RFO (ours)
Soft Jumper	1.00	1.39	2.63
Ant	1.00	1.55	1.81
Hand Reorient	1.00	1.22	1.48
ANYmal	1.00	0.97	1.07
Transport	1.00	1.05	1.87
Rolling Pin	1.00	1.03	1.06
Hand Flip	1.00	1.27	1.26

Additionally, ablation studies confirm the necessity of both CFM regularization components for robust, stable learning (Zhong et al., 3 Feb 2026).

6. Theoretical Insights and Limitations

RFO uses the pathwise derivative throughout the combined flow and simulator graph, which yields low-variance gradient estimates characteristic of reparameterization but without requiring log-likelihood or density estimation. Unlike surrogate likelihood-ratio methods, RFO relies exclusively on differentiability of both the flow policy and environment dynamics.

The regularization terms are formulated as flow-matching objectives that guarantee stability (by maintaining reachability of previously visited actions) and promote global exploration (by injecting probability mass throughout the action space). RFO does not require explicit normalization constants or Jacobian determinants.

The method presupposes access to a fully differentiable simulator or a learned differentiable transition model. Backpropagation through long horizons may be computationally expensive but is partially mitigated via short-horizon proxy objectives. Action-chunked variants are supported but present greater optimization difficulty (Zhong et al., 3 Feb 2026).

RFO generalizes prior reparameterization-based policy gradient algorithms by supporting expressive, multimodal CNF policies. Unlike PolicyFlow (Yang et al., 1 Feb 2026), which targets on-policy PPO via a reparameterized importance-ratio surrogate combined with a Brownian-motivated entropy regularizer, RFO is fundamentally pathwise and sidesteps likelihood-ratio computations and their typical numerical instability. Off-policy RFO instantiations, as in SAC Flow (Zhang et al., 30 Sep 2025), further exploit velocity reparameterizations (Flow-G, Flow-T) to overcome gradient pathologies arising from the underlying residual RNN structure of flow rollouts.

Research directions highlighted in the literature include offline-to-online pretraining for flow policies, hybridizing with diffusion-based policies, and integrating with non-ODE generative models for richer policy classes (Zhong et al., 3 Feb 2026, Zhang et al., 30 Sep 2025).

RFO establishes a principled and empirically validated framework for policy optimization with high-capacity flows, providing stable, efficient, and expressive solutions for contemporary RL control challenges.