One-Step Flow Policy Mirror Descent

Updated 10 December 2025

The paper introduces FPMD, an RL algorithm that leverages flow matching to enable one-step inference, significantly reducing inference latency while improving sample efficiency.
It employs flow-matching parameterizations (FPMD-R and FPMD-M) that use single-step Euler integration to approximate complex, non-Gaussian policies in continuous control environments.
Empirical results on Gym-MuJoCo benchmarks show that FPMD variants match or outperform state-of-the-art methods by reducing network calls and maintaining competitive inference speeds.

One-Step Flow Policy Mirror Descent (FPMD) is an online reinforcement learning (RL) algorithm developed to enable highly efficient, single-step policy inference for expressive policy classes, notably in continuous control environments. FPMD bridges the expressive power of diffusion-based models and the real-time responsiveness characteristic of classical RL policies, achieving single-step sampling via flow-matching parameterizations. This approach builds on policy mirror descent (PMD) using conditional flow and MeanFlow models and provides theoretical and empirical advantages in terms of inference latency, sample efficiency, and convergence properties (Chen et al., 31 Jul 2025, Alfano et al., 2023).

1. Problem Setting and Motivating Context

FPMD operates in the standard discounted Markov Decision Process (MDP) framework:

State-space: $\mathcal{S}$
Continuous action-space: $\mathcal{A} \subseteq \mathbb{R}^d$
Transition kernel: $P(s'|s,a)$
Reward function: $r:\mathcal{S} \times \mathcal{A} \to \mathbb{R}$
Discount factor: $\gamma \in (0,1)$
Initial state distribution: $\mu_0$

The objective is to maximize the expected discounted return: $\rho(\pi) = \mathbb{E}\left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\right], \quad a_t \sim \pi(\cdot|s_t),\ s_{t+1} \sim P(\cdot|s_t,a_t)$ where $\pi(a|s)\in\Delta(\mathcal{A})$ is the (possibly non-Gaussian) stochastic policy.

Contemporary diffusion-policy methods learn highly expressive representations by reversing a forward noising process via a diffusion model. At inference, these typically require many iterative steps (number of function evaluations, NFE), resulting in sampling latencies ( $K \sim 20$ –$640$) that are prohibitive for real-time control. By contrast, classical Gaussian actors allow single-step inference but lack representational power for complex action spaces (Chen et al., 31 Jul 2025).

2. Algorithmic Framework: FPMD and Its Variants

2.1 Mirror Descent Objective

FPMD is based on a KL-regularized mirror descent update: $\pi_{\text{new}} = \arg\max_\pi \mathbb{E}_{a\sim\pi(\cdot|s)}[Q^{\pi_{\text{old}}}(s,a)] - \lambda D_{\mathrm{KL}}[\pi(\cdot|s)\|\pi_{\text{old}}(\cdot|s)]$ with the closed-form Boltzmann policy solution: $\pi_{\text{new}}(a|s) = \frac{\pi_{\text{old}}(a|s) \exp\left(\frac{1}{\lambda} Q^{\pi_{\text{old}}}(s,a)\right)}{Z(s)}$ where $Z(s)$ is the normalization constant.

2.2 Flow-Matching Parameterizations

Flow Policy (FPMD-R): The new policy $\pi_{\text{new}}$ is implicitly defined via a straight-line conditional flow: $\frac{da_t}{dt} = v_\theta(t, a_t | s), \quad a_0\sim\mathcal{N}(\mu,\sigma^2),\ a_1\sim\pi_{\text{new}}(a|s)$ with flow-matching loss: $L_{\rm FPMD}(\theta) = \mathbb{E}\Big[w(s,a_1)\left\| (a_1-a_0) - v_\theta(t,a_t|s) \right\|^2\Big]$ where $w(s,a_1) = \exp\left( Q^{\pi_{\text{old}}}(s,a_1)/\lambda \right)$ and $t \sim U[0,1]$ , $a_t = (1-t)a_0 + ta_1$ .

Inference proceeds via single-step Euler integration: $\hat{a}_1 = a_0 + v_\theta(0, a_0|s)$

MeanFlow Policy (FPMD-M): Here, a native one-step parameterization is adopted by defining an average-velocity field: $u(t,r,a_t|s) = \frac{1}{t-r}\int_r^t v(\tau,a_\tau|s)d\tau$ MeanFlow is trained to satisfy $a_1 = a_0 + u_\theta(1,0,a_0|s)$ via a variational residual loss, and inference uses: $\hat{a}_1 = a_0 + u_\theta(1,0,a_0|s)$

2.3 Pseudocode Structure

The overall FPMD algorithm alternates critic and actor updates using the described flow or MeanFlow actor. During training, FPMD-R uses multi-step flows ( $K=20$ ); at evaluation, both FPMD-R and FPMD-M require only a single forward network evaluation ( $K=1$ ) (Chen et al., 31 Jul 2025).

Variant	Training NFE	Eval NFE	Parametrization
FPMD-R	$K=20$	1	Conditional flow
FPMD-M	1	1	MeanFlow (one-step)

3. Theoretical Guarantees

3.1 Discretization Error and Single-Step Validity

FPMD leverages recent results on the discretization error for conditional flows [Hu et al. 2024]. For one-step Euler integration: $W_2(\pi_{\text{new}}, \hat{\pi}_1) \leq \sqrt{\mathrm{Var}(a_1|s)}$ showing that single-step inference is near-exact whenever the variance of the target Boltzmann policy is sufficiently small. This theoretical bound provides justification for replacing iterative diffusion with flow-based one-step sampling in low-variance regimes.

3.2 MeanFlow Convergence

Subject to a contraction assumption on the operator defined by the MeanFlow residual, repeated minimization converges to the correct average-velocity field, establishing exactness in the one-step setting for this class of velocity fields.

3.3 PMD in General Policy Classes

The one-step PMD framework, as formalized in (Alfano et al., 2023), admits broader parameterization: $\pi^{t+1}_s = \arg\min_{p\in\Delta(A)} D_\psi(p,\nabla\psi^*(y^{t+1}_s)), \quad y^{t+1}_s = \nabla\psi(\pi^t_s) + \eta_t Q^t_s$ where $\psi$ is a mirror map and $D_\psi$ the Bregman divergence. With assumptions on approximation error, concentrability, and distribution mismatch, AMPO (Approximate Mirror Policy Optimization) achieves linear convergence in suboptimality gap.

4. Empirical Evaluation

FPMD was extensively evaluated on the 10-task Gym-MuJoCo v4 benchmark, including HalfCheetah, Walker2d, Ant, and Humanoid (Chen et al., 31 Jul 2025).

Key Findings

Performance: FPMD-R matches or outperforms state-of-the-art diffusion-policy baselines (which require $20$–$640$ network calls per action) in 8/10 tasks while using NFE $=1$ at inference.
Inference Latency: FPMD-R and FPMD-M achieve comparable inference times (0.13–0.14 ms per action) to SAC, and are an order of magnitude faster than SDAC (diffusion) at $1.46$ ms.
Ablation Studies: Training FPMD-R with varying NFE ( $1\rightarrow20$ ) shows increasing performance up to saturation; FPMD-M is more efficient in training but may exhibit slightly lower peak performance.

Algorithm	Inference NFE	Inference Latency (ms)	Empirical Score (tasks matched/exceeded)
SAC	1	0.13	Lower than FPMD-R on high-dim
SDAC	20–640	1.46	High, but slow
FPMD-R	1	0.13	8/10
FPMD-M	1	0.14	Comparable, slightly cheaper training

Metrics included final cumulative reward, sample efficiency, inference speed (wall-clock ms), and NFE per action.

5. Implementation and Practical Considerations

Architecture and Training

Critic: Two $256$-unit ReLU MLPs
Flow actor $v_\theta$ : $3 \times 256$ MLP, inputs: state $s$ , time $t$ , and action $a_t$
MeanFlow $u_\theta$ : Same, with $(r, t)$ scalar endpoints added

Training hyperparameters include Adam optimizer with learning rates $3\mathrm{e}\!-\!4$ for actor and critic, batch size 256, 1M replay buffer, and a prefill with 5K random samples.

Exploration

A “best-of-N” sampling strategy is used for exploratory action selection: $N=20$ candidate actions are sampled, and the action with maximal $Q(s,\cdot)$ is selected.

Limitations

Single-step Euler bias can be large in early training when policy variance is high; FPMD-R mitigates this via multi-step training and single-step evaluation.
The contraction assumption for MeanFlow convergence may be violated in high-variance regimes, potentially reducing performance relative to FPMD-R.
Lambda in the mirror descent objective tunes exploration-exploitation tradeoff, with typical values in $[0.1, 1.0]$ .

6. Extensions and Connections

Extensions

Integration with latent variable models (e.g., VAEs) enables application to pixel or image-based observations.
Discrete-action adaptation is plausible via discrete flow parameterizations or rectified discrete-flow matching.
Adaptive step-size or higher-order (e.g., Heun) Euler schemes may further reduce discretization bias.
Incorporation of trust-region-style KL constraints is possible.

Connections to Mirror Descent Literature

FPMD extends classical PMD frameworks (Alfano et al., 2023) to highly expressive, non-Gaussian policies via flows, retaining theoretical guarantees and unifying tabular, log-linear, and deep policy parameterizations. The AMPO variant achieves linear convergence for general parametric classes, admitting near-optimal sample complexity in shallow neural network policies.

Methodology	Key Benefit	Theoretical Guarantee
FPMD (flow/MeanFlow)	Fast, expressive policies	Single-step Wasserstein bound
Classical PMD/AMPO	General parameterization	Linear convergence under assumptions

References

"One-Step Flow Policy Mirror Descent" (Chen et al., 31 Jul 2025)
"A Novel Framework for Policy Mirror Descent with General Parameterization and Linear Convergence" (Alfano et al., 2023)

PDF Markdown Chat (Pro)

References (2)

One-Step Flow Policy Mirror Descent (2025)

A Novel Framework for Policy Mirror Descent with General Parameterization and Linear Convergence (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to One-Step Flow Policy Mirror Descent.