Truncated Rectified Flow Policy for Reinforcement Learning with One-Step Sampling

Published 10 Apr 2026 in cs.LG | (2604.09159v1)

Abstract: Maximum entropy reinforcement learning (MaxEnt RL) has become a standard framework for sequential decision making, yet its standard Gaussian policy parameterization is inherently unimodal, limiting its ability to model complex multimodal action distributions. This limitation has motivated increasing interest in generative policies based on diffusion and flow matching as more expressive alternatives. However, incorporating such policies into MaxEnt RL is challenging for two main reasons: the likelihood and entropy of continuous-time generative policies are generally intractable, and multi-step sampling introduces both long-horizon backpropagation instability and substantial inference latency. To address these challenges, we propose Truncated Rectified Flow Policy (TRFP), a framework built on a hybrid deterministic-stochastic architecture. This design makes entropy-regularized optimization tractable while supporting stable training and effective one-step sampling through gradient truncation and flow straightening. Empirical results on a toy multigoal environment and 10 MuJoCo benchmarks show that TRFP captures multimodal behavior effectively, outperforms strong baselines on most benchmarks under standard sampling, and remains highly competitive under one-step sampling.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a hybrid deterministic-stochastic policy (TRFP) that overcomes the unimodal limitations in MaxEnt RL through one-step sampling.
It employs flow straightening regularization and a decoupled architecture to mitigate gradient instability and reduce inference latency.
Empirical results validate TRFP's ability to maintain multimodal coverage and competitive returns on standard benchmarks.

Truncated Rectified Flow Policies for Multimodal Reinforcement Learning with Efficient One-Step Sampling

Introduction and Motivation

This paper addresses an intrinsic representational bottleneck in maximum entropy reinforcement learning (MaxEnt RL): the reliance on unimodal Gaussian policy parameterizations. Although MaxEnt RL, exemplified by SAC, enhances exploration and robustness via entropy regularization, standard Gaussian policies cannot model complex multimodal action distributions required for many real-world and multi-goal tasks. The paper systematically critiques this limitation and surveys recent literature on generative policy parameterizations—especially diffusion probabilistic models (DPMs) and continuous normalizing flows (CNFs)—which offer enhanced expressivity at the cost of computational tractability and inference complexity. The two dominant challenges in prior diffusion/flow-based policies are (1) intractable action likelihoods impeding stable entropy maximization and (2) multi-step iterative generation causing both backpropagation instability and prohibitive inference latencies.

The paper introduces the Truncated Rectified Flow Policy (TRFP), a hybrid deterministic-stochastic policy architecture that enables tractable surrogate entropy-regularized RL, resolves the inference bottleneck via one-step sampling, and yields competitive or superior empirical performance on multimodal and standard RL benchmarks.

Methodology

Hybrid Policy Architecture

TRFP divides the policy generation trajectory into a deterministic prefix and a stochastic tail. The prefix segment (ODE-driven) transports a latent variable from an isotropic Gaussian prior toward the target action distribution, exploiting the straight-line geometric bias induced by rectified flow (RF) training. The tail segment (SDE-driven) introduces stochasticity, enabling a tractable surrogate likelihood computation over the augmented trajectory, and facilitating sample diversity for robust exploration.

This architectural decoupling supports adaptive trade-offs between expressivity and inference efficiency. During training, the stochastic tail enables Q-weighted actor updates and entropy estimation via a tractable surrogate. During inference (and in evaluation-ablation modes), the stochastic tail can be omitted, using only the deterministic transport, drastically reducing sampling cost.

Surrogate Entropy Optimization and Truncated Gradients

In the MaxEnt RL framework, stable entropy regularization depends on evaluating the log-likelihood of the policy, which is generally intractable for ODE-based flows due to Jacobian divergence integration. TRFP circumvents this with a surrogate likelihood, which only accounts for the stochastic tail, omitting the (deterministically volume-preserving) transport prefix. The error in this surrogate is upper-bounded by the trajectory divergence, which is minimized by the flow straightening regularizer.

Gradient instability—especially with deep, multi-step generation chains—is handled via truncated optimization. Gradients are not backpropagated through the prefix transport; the RL updates only affect the stochastic tail. This ensures training stability even with longer rollout horizons.

Flow Straightening Regularization

Flow straightening is enforced via self-distillation, where deterministic (zero-noise) rollouts generate trajectory endpoints that serve as targets for the ODE vector field. By regressing the flow field to a constant vector (source to target), the deterministic prefix approximates an optimal transport map, minimizing both geometric curvature and the entropy surrogate error. This in turn justifies efficient low-NFE (number of function evaluations) or even one-step rollout at deployment, which is critical for real-time control scenarios.

Figure 1: TRFP exhibits uniform multimodal coverage in the multi-goal setting, where baseline Gaussian policies fail to explore symmetrically.

Empirical Results

Multimodality and Policy Capacity

TRFP’s efficacy in representing multimodal policies is validated in a symmetric multi-goal toy environment. Unlike SAC and TD3, which collapse to a subset of goals, TRFP maintains balanced coverage, directly evidencing enhanced expressive capacity (Figure 1).

Benchmark Comparisons and Inference Efficiency

On 10 standard MuJoCo-v5 benchmarks, TRFP is compared against SAC, TD3, and recent diffusion policy baselines (SDAC, MaxEntDP). Under standard sampling (K=4), TRFP matches or outperforms all baselines in cumulative returns. When compressed to one-step inference (K=1), TRFP’s performance remains highly competitive, with minimal degradation, greatly outperforming other generative models at equivalent inference budgets.

Figure 2: TRFP achieves top returns under both standard and one-step sampling regimes, with lower inference costs compared to prior generative policy methods across MuJoCo benchmarks.

The one-step evaluation demonstrates the effectiveness of flow straightening: high return is maintained with a singular deterministic step, minimizing inference latency—a distinct practical advantage, particularly for real-time applications.

Ablation Analyses

A series of ablations isolate the effect of core architectural and algorithmic components:

Removing flow straightening ( $\mathcal{L}_{\text{fm}}$ ) drastically reduces one-step performance, affirming its necessity for fast-sampling regimes.
Disabling Q-guided action selection decreases cumulative return by approximately 10%, indicating the benefit of parallel candidate evaluation and value-guided selection in multimodal policy spaces.
Omitting the stochastic tail impairs exploration and convergence on complex tasks like Humanoid, consistent with its role in supporting entropy-driven exploration and diversity.

Figure 3: Ablations demonstrate that flow straightening is critical for one-step performance, Q-guided action selection improves robustness, and the stochastic tail accelerates exploration in complex environments.

Implications and Future Directions

TRFP demonstrates that generative, highly expressive flow-based policies can be reconciled with entropy-regularized RL objectives, stable training, and low-latency inference—addressing longstanding issues in the use of DPMs and CNFs for online RL. The hybrid prefix-tail architecture, surrogate likelihood, truncated gradients, and flow straightening regularization collectively deliver a method that is both theoretically sound and practically scalable.

Potential extensions include adapting TRFP to large-scale robot manipulation, real-world robotics, and long-horizon planning, as well as further theoretical refinement of surrogate likelihood error bounds and policy expressiveness. Additionally, TRFP’s decoupled training paradigm could inform future research in offline RL and the fine-tuning of value-conditioned latent-action (VLA) policies.

Conclusion

TRFP provides a principled and empirically validated approach to multimodal policy learning in MaxEnt RL, unifying expressivity, tractable entropy maximization, and efficient, low-latency inference. The experimental results substantiate its superiority over both unimodal and generative baselines across challenging control tasks. The hybrid deterministic-stochastic scheme and flow straightening regularization are of particular importance, serving as essential design choices for future generative policy architectures in RL.

Markdown Report Issue