One-Step Flow Policy Mirror Descent
- The paper introduces FPMD, an RL algorithm that leverages flow matching to enable one-step inference, significantly reducing inference latency while improving sample efficiency.
- It employs flow-matching parameterizations (FPMD-R and FPMD-M) that use single-step Euler integration to approximate complex, non-Gaussian policies in continuous control environments.
- Empirical results on Gym-MuJoCo benchmarks show that FPMD variants match or outperform state-of-the-art methods by reducing network calls and maintaining competitive inference speeds.
One-Step Flow Policy Mirror Descent (FPMD) is an online reinforcement learning (RL) algorithm developed to enable highly efficient, single-step policy inference for expressive policy classes, notably in continuous control environments. FPMD bridges the expressive power of diffusion-based models and the real-time responsiveness characteristic of classical RL policies, achieving single-step sampling via flow-matching parameterizations. This approach builds on policy mirror descent (PMD) using conditional flow and MeanFlow models and provides theoretical and empirical advantages in terms of inference latency, sample efficiency, and convergence properties (Chen et al., 31 Jul 2025, Alfano et al., 2023).
1. Problem Setting and Motivating Context
FPMD operates in the standard discounted Markov Decision Process (MDP) framework:
- State-space:
- Continuous action-space:
- Transition kernel:
- Reward function:
- Discount factor:
- Initial state distribution:
The objective is to maximize the expected discounted return: where is the (possibly non-Gaussian) stochastic policy.
Contemporary diffusion-policy methods learn highly expressive representations by reversing a forward noising process via a diffusion model. At inference, these typically require many iterative steps (number of function evaluations, NFE), resulting in sampling latencies (–$640$) that are prohibitive for real-time control. By contrast, classical Gaussian actors allow single-step inference but lack representational power for complex action spaces (Chen et al., 31 Jul 2025).
2. Algorithmic Framework: FPMD and Its Variants
2.1 Mirror Descent Objective
FPMD is based on a KL-regularized mirror descent update: with the closed-form Boltzmann policy solution: where is the normalization constant.
2.2 Flow-Matching Parameterizations
Flow Policy (FPMD-R): The new policy is implicitly defined via a straight-line conditional flow: with flow-matching loss: where and , .
Inference proceeds via single-step Euler integration:
MeanFlow Policy (FPMD-M): Here, a native one-step parameterization is adopted by defining an average-velocity field: MeanFlow is trained to satisfy via a variational residual loss, and inference uses:
2.3 Pseudocode Structure
The overall FPMD algorithm alternates critic and actor updates using the described flow or MeanFlow actor. During training, FPMD-R uses multi-step flows (); at evaluation, both FPMD-R and FPMD-M require only a single forward network evaluation () (Chen et al., 31 Jul 2025).
| Variant | Training NFE | Eval NFE | Parametrization |
|---|---|---|---|
| FPMD-R | 1 | Conditional flow | |
| FPMD-M | 1 | 1 | MeanFlow (one-step) |
3. Theoretical Guarantees
3.1 Discretization Error and Single-Step Validity
FPMD leverages recent results on the discretization error for conditional flows [Hu et al. 2024]. For one-step Euler integration: showing that single-step inference is near-exact whenever the variance of the target Boltzmann policy is sufficiently small. This theoretical bound provides justification for replacing iterative diffusion with flow-based one-step sampling in low-variance regimes.
3.2 MeanFlow Convergence
Subject to a contraction assumption on the operator defined by the MeanFlow residual, repeated minimization converges to the correct average-velocity field, establishing exactness in the one-step setting for this class of velocity fields.
3.3 PMD in General Policy Classes
The one-step PMD framework, as formalized in (Alfano et al., 2023), admits broader parameterization: where is a mirror map and the Bregman divergence. With assumptions on approximation error, concentrability, and distribution mismatch, AMPO (Approximate Mirror Policy Optimization) achieves linear convergence in suboptimality gap.
4. Empirical Evaluation
FPMD was extensively evaluated on the 10-task Gym-MuJoCo v4 benchmark, including HalfCheetah, Walker2d, Ant, and Humanoid (Chen et al., 31 Jul 2025).
Key Findings
- Performance: FPMD-R matches or outperforms state-of-the-art diffusion-policy baselines (which require $20$–$640$ network calls per action) in 8/10 tasks while using NFE at inference.
- Inference Latency: FPMD-R and FPMD-M achieve comparable inference times (0.13–0.14 ms per action) to SAC, and are an order of magnitude faster than SDAC (diffusion) at $1.46$ ms.
- Ablation Studies: Training FPMD-R with varying NFE () shows increasing performance up to saturation; FPMD-M is more efficient in training but may exhibit slightly lower peak performance.
| Algorithm | Inference NFE | Inference Latency (ms) | Empirical Score (tasks matched/exceeded) |
|---|---|---|---|
| SAC | 1 | 0.13 | Lower than FPMD-R on high-dim |
| SDAC | 20–640 | 1.46 | High, but slow |
| FPMD-R | 1 | 0.13 | 8/10 |
| FPMD-M | 1 | 0.14 | Comparable, slightly cheaper training |
Metrics included final cumulative reward, sample efficiency, inference speed (wall-clock ms), and NFE per action.
5. Implementation and Practical Considerations
Architecture and Training
- Critic: Two $256$-unit ReLU MLPs
- Flow actor : MLP, inputs: state , time , and action
- MeanFlow : Same, with scalar endpoints added
Training hyperparameters include Adam optimizer with learning rates for actor and critic, batch size 256, 1M replay buffer, and a prefill with 5K random samples.
Exploration
A “best-of-N” sampling strategy is used for exploratory action selection: candidate actions are sampled, and the action with maximal is selected.
Limitations
- Single-step Euler bias can be large in early training when policy variance is high; FPMD-R mitigates this via multi-step training and single-step evaluation.
- The contraction assumption for MeanFlow convergence may be violated in high-variance regimes, potentially reducing performance relative to FPMD-R.
- Lambda in the mirror descent objective tunes exploration-exploitation tradeoff, with typical values in .
6. Extensions and Connections
Extensions
- Integration with latent variable models (e.g., VAEs) enables application to pixel or image-based observations.
- Discrete-action adaptation is plausible via discrete flow parameterizations or rectified discrete-flow matching.
- Adaptive step-size or higher-order (e.g., Heun) Euler schemes may further reduce discretization bias.
- Incorporation of trust-region-style KL constraints is possible.
Connections to Mirror Descent Literature
FPMD extends classical PMD frameworks (Alfano et al., 2023) to highly expressive, non-Gaussian policies via flows, retaining theoretical guarantees and unifying tabular, log-linear, and deep policy parameterizations. The AMPO variant achieves linear convergence for general parametric classes, admitting near-optimal sample complexity in shallow neural network policies.
| Methodology | Key Benefit | Theoretical Guarantee |
|---|---|---|
| FPMD (flow/MeanFlow) | Fast, expressive policies | Single-step Wasserstein bound |
| Classical PMD/AMPO | General parameterization | Linear convergence under assumptions |
References
- "One-Step Flow Policy Mirror Descent" (Chen et al., 31 Jul 2025)
- "A Novel Framework for Policy Mirror Descent with General Parameterization and Linear Convergence" (Alfano et al., 2023)