Papers
Topics
Authors
Recent
2000 character limit reached

One-Step Flow Policy Mirror Descent

Updated 10 December 2025
  • The paper introduces FPMD, an RL algorithm that leverages flow matching to enable one-step inference, significantly reducing inference latency while improving sample efficiency.
  • It employs flow-matching parameterizations (FPMD-R and FPMD-M) that use single-step Euler integration to approximate complex, non-Gaussian policies in continuous control environments.
  • Empirical results on Gym-MuJoCo benchmarks show that FPMD variants match or outperform state-of-the-art methods by reducing network calls and maintaining competitive inference speeds.

One-Step Flow Policy Mirror Descent (FPMD) is an online reinforcement learning (RL) algorithm developed to enable highly efficient, single-step policy inference for expressive policy classes, notably in continuous control environments. FPMD bridges the expressive power of diffusion-based models and the real-time responsiveness characteristic of classical RL policies, achieving single-step sampling via flow-matching parameterizations. This approach builds on policy mirror descent (PMD) using conditional flow and MeanFlow models and provides theoretical and empirical advantages in terms of inference latency, sample efficiency, and convergence properties (Chen et al., 31 Jul 2025, Alfano et al., 2023).

1. Problem Setting and Motivating Context

FPMD operates in the standard discounted Markov Decision Process (MDP) framework:

  • State-space: S\mathcal{S}
  • Continuous action-space: ARd\mathcal{A} \subseteq \mathbb{R}^d
  • Transition kernel: P(ss,a)P(s'|s,a)
  • Reward function: r:S×ARr:\mathcal{S} \times \mathcal{A} \to \mathbb{R}
  • Discount factor: γ(0,1)\gamma \in (0,1)
  • Initial state distribution: μ0\mu_0

The objective is to maximize the expected discounted return: ρ(π)=E[t=0γtr(st,at)],atπ(st), st+1P(st,at)\rho(\pi) = \mathbb{E}\left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\right], \quad a_t \sim \pi(\cdot|s_t),\ s_{t+1} \sim P(\cdot|s_t,a_t) where π(as)Δ(A)\pi(a|s)\in\Delta(\mathcal{A}) is the (possibly non-Gaussian) stochastic policy.

Contemporary diffusion-policy methods learn highly expressive representations by reversing a forward noising process via a diffusion model. At inference, these typically require many iterative steps (number of function evaluations, NFE), resulting in sampling latencies (K20K \sim 20–$640$) that are prohibitive for real-time control. By contrast, classical Gaussian actors allow single-step inference but lack representational power for complex action spaces (Chen et al., 31 Jul 2025).

2. Algorithmic Framework: FPMD and Its Variants

2.1 Mirror Descent Objective

FPMD is based on a KL-regularized mirror descent update: πnew=argmaxπEaπ(s)[Qπold(s,a)]λDKL[π(s)πold(s)]\pi_{\text{new}} = \arg\max_\pi \mathbb{E}_{a\sim\pi(\cdot|s)}[Q^{\pi_{\text{old}}}(s,a)] - \lambda D_{\mathrm{KL}}[\pi(\cdot|s)\|\pi_{\text{old}}(\cdot|s)] with the closed-form Boltzmann policy solution: πnew(as)=πold(as)exp(1λQπold(s,a))Z(s)\pi_{\text{new}}(a|s) = \frac{\pi_{\text{old}}(a|s) \exp\left(\frac{1}{\lambda} Q^{\pi_{\text{old}}}(s,a)\right)}{Z(s)} where Z(s)Z(s) is the normalization constant.

2.2 Flow-Matching Parameterizations

Flow Policy (FPMD-R): The new policy πnew\pi_{\text{new}} is implicitly defined via a straight-line conditional flow: datdt=vθ(t,ats),a0N(μ,σ2), a1πnew(as)\frac{da_t}{dt} = v_\theta(t, a_t | s), \quad a_0\sim\mathcal{N}(\mu,\sigma^2),\ a_1\sim\pi_{\text{new}}(a|s) with flow-matching loss: LFPMD(θ)=E[w(s,a1)(a1a0)vθ(t,ats)2]L_{\rm FPMD}(\theta) = \mathbb{E}\Big[w(s,a_1)\left\| (a_1-a_0) - v_\theta(t,a_t|s) \right\|^2\Big] where w(s,a1)=exp(Qπold(s,a1)/λ)w(s,a_1) = \exp\left( Q^{\pi_{\text{old}}}(s,a_1)/\lambda \right) and tU[0,1]t \sim U[0,1], at=(1t)a0+ta1a_t = (1-t)a_0 + ta_1.

Inference proceeds via single-step Euler integration: a^1=a0+vθ(0,a0s)\hat{a}_1 = a_0 + v_\theta(0, a_0|s)

MeanFlow Policy (FPMD-M): Here, a native one-step parameterization is adopted by defining an average-velocity field: u(t,r,ats)=1trrtv(τ,aτs)dτu(t,r,a_t|s) = \frac{1}{t-r}\int_r^t v(\tau,a_\tau|s)d\tau MeanFlow is trained to satisfy a1=a0+uθ(1,0,a0s)a_1 = a_0 + u_\theta(1,0,a_0|s) via a variational residual loss, and inference uses: a^1=a0+uθ(1,0,a0s)\hat{a}_1 = a_0 + u_\theta(1,0,a_0|s)

2.3 Pseudocode Structure

The overall FPMD algorithm alternates critic and actor updates using the described flow or MeanFlow actor. During training, FPMD-R uses multi-step flows (K=20K=20); at evaluation, both FPMD-R and FPMD-M require only a single forward network evaluation (K=1K=1) (Chen et al., 31 Jul 2025).

Variant Training NFE Eval NFE Parametrization
FPMD-R K=20K=20 1 Conditional flow
FPMD-M 1 1 MeanFlow (one-step)

3. Theoretical Guarantees

3.1 Discretization Error and Single-Step Validity

FPMD leverages recent results on the discretization error for conditional flows [Hu et al. 2024]. For one-step Euler integration: W2(πnew,π^1)Var(a1s)W_2(\pi_{\text{new}}, \hat{\pi}_1) \leq \sqrt{\mathrm{Var}(a_1|s)} showing that single-step inference is near-exact whenever the variance of the target Boltzmann policy is sufficiently small. This theoretical bound provides justification for replacing iterative diffusion with flow-based one-step sampling in low-variance regimes.

3.2 MeanFlow Convergence

Subject to a contraction assumption on the operator defined by the MeanFlow residual, repeated minimization converges to the correct average-velocity field, establishing exactness in the one-step setting for this class of velocity fields.

3.3 PMD in General Policy Classes

The one-step PMD framework, as formalized in (Alfano et al., 2023), admits broader parameterization: πst+1=argminpΔ(A)Dψ(p,ψ(yst+1)),yst+1=ψ(πst)+ηtQst\pi^{t+1}_s = \arg\min_{p\in\Delta(A)} D_\psi(p,\nabla\psi^*(y^{t+1}_s)), \quad y^{t+1}_s = \nabla\psi(\pi^t_s) + \eta_t Q^t_s where ψ\psi is a mirror map and DψD_\psi the Bregman divergence. With assumptions on approximation error, concentrability, and distribution mismatch, AMPO (Approximate Mirror Policy Optimization) achieves linear convergence in suboptimality gap.

4. Empirical Evaluation

FPMD was extensively evaluated on the 10-task Gym-MuJoCo v4 benchmark, including HalfCheetah, Walker2d, Ant, and Humanoid (Chen et al., 31 Jul 2025).

Key Findings

  • Performance: FPMD-R matches or outperforms state-of-the-art diffusion-policy baselines (which require $20$–$640$ network calls per action) in 8/10 tasks while using NFE =1=1 at inference.
  • Inference Latency: FPMD-R and FPMD-M achieve comparable inference times (0.13–0.14 ms per action) to SAC, and are an order of magnitude faster than SDAC (diffusion) at $1.46$ ms.
  • Ablation Studies: Training FPMD-R with varying NFE (1201\rightarrow20) shows increasing performance up to saturation; FPMD-M is more efficient in training but may exhibit slightly lower peak performance.
Algorithm Inference NFE Inference Latency (ms) Empirical Score (tasks matched/exceeded)
SAC 1 0.13 Lower than FPMD-R on high-dim
SDAC 20–640 1.46 High, but slow
FPMD-R 1 0.13 8/10
FPMD-M 1 0.14 Comparable, slightly cheaper training

Metrics included final cumulative reward, sample efficiency, inference speed (wall-clock ms), and NFE per action.

5. Implementation and Practical Considerations

Architecture and Training

  • Critic: Two $256$-unit ReLU MLPs
  • Flow actor vθv_\theta: 3×2563 \times 256 MLP, inputs: state ss, time tt, and action ata_t
  • MeanFlow uθu_\theta: Same, with (r,t)(r, t) scalar endpoints added

Training hyperparameters include Adam optimizer with learning rates 3e ⁣ ⁣43\mathrm{e}\!-\!4 for actor and critic, batch size 256, 1M replay buffer, and a prefill with 5K random samples.

Exploration

A “best-of-N” sampling strategy is used for exploratory action selection: N=20N=20 candidate actions are sampled, and the action with maximal Q(s,)Q(s,\cdot) is selected.

Limitations

  • Single-step Euler bias can be large in early training when policy variance is high; FPMD-R mitigates this via multi-step training and single-step evaluation.
  • The contraction assumption for MeanFlow convergence may be violated in high-variance regimes, potentially reducing performance relative to FPMD-R.
  • Lambda in the mirror descent objective tunes exploration-exploitation tradeoff, with typical values in [0.1,1.0][0.1, 1.0].

6. Extensions and Connections

Extensions

  • Integration with latent variable models (e.g., VAEs) enables application to pixel or image-based observations.
  • Discrete-action adaptation is plausible via discrete flow parameterizations or rectified discrete-flow matching.
  • Adaptive step-size or higher-order (e.g., Heun) Euler schemes may further reduce discretization bias.
  • Incorporation of trust-region-style KL constraints is possible.

Connections to Mirror Descent Literature

FPMD extends classical PMD frameworks (Alfano et al., 2023) to highly expressive, non-Gaussian policies via flows, retaining theoretical guarantees and unifying tabular, log-linear, and deep policy parameterizations. The AMPO variant achieves linear convergence for general parametric classes, admitting near-optimal sample complexity in shallow neural network policies.

Methodology Key Benefit Theoretical Guarantee
FPMD (flow/MeanFlow) Fast, expressive policies Single-step Wasserstein bound
Classical PMD/AMPO General parameterization Linear convergence under assumptions

References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to One-Step Flow Policy Mirror Descent.