Mean Flow Policy Optimization

Published 16 Apr 2026 in cs.LG | (2604.14698v1)

Abstract: Diffusion models have recently emerged as expressive policy representations for online reinforcement learning (RL). However, their iterative generative processes introduce substantial training and inference overhead. To overcome this limitation, we propose to represent policies using MeanFlow models, a class of few-step flow-based generative models, to improve training and inference efficiency over diffusion-based RL approaches. To promote exploration, we optimize MeanFlow policies under the maximum entropy RL framework via soft policy iteration, and address two key challenges specific to MeanFlow policies: action likelihood evaluation and soft policy improvement. Experiments on MuJoCo and DeepMind Control Suite benchmarks demonstrate that our method, Mean Flow Policy Optimization (MFPO), achieves performance comparable to or exceeding current diffusion-based baselines while considerably reducing training and inference time. Our code is available at https://github.com/MFPolicy/MFPO.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces MFPO, which leverages a learned mean velocity field to reduce integration steps from 16–20 to just 2, boosting sample efficiency.
The paper integrates an average divergence network and adaptive self-normalized importance sampling for accurate likelihood estimation and stable policy improvement.
The paper demonstrates state-of-the-art performance on benchmarks like MuJoCo and DeepMind Control Suite with significantly reduced training and inference times compared to diffusion methods.

Mean Flow Policy Optimization: Accelerating Expressive Policy Learning in MaxEnt RL

Introduction

Mean Flow Policy Optimization (MFPO) presents a significant methodological advance in continuous action reinforcement learning (RL), leveraging the recent MeanFlow generative model formalism to circumvent the computational bottlenecks inherent in diffusion-based expressive policy representations. The work systematizes a maximal entropy (MaxEnt) RL paradigm around efficient MeanFlow model integration, introducing novel estimators for action likelihood and policy improvement, and demonstrating compelling empirical and computational performance on high-dimensional continuous control benchmarks.

Background and Motivation

Standard RL approaches in continuous control predominantly utilize unimodal policy classes—either deterministic mappings or Gaussian parameterizations. While such parameterizations are theoretically capable of representing optimal policies, they exhibit poor sample efficiency in environments characterized by multi-modal reward landscapes, often resulting in policy collapse around local optima. Recent efforts have thus adopted diffusion and flow models as policy classes due to their capacity for supporting highly multi-modal, expressive distributions. However, these advances have led to an unfavorable trade-off: high-quality exploration comes at the cost of slow, iterative sample generation, as diffusion models typically require 10–20 step ODE integration per policy evaluation and gradient update.

The MeanFlow methodology modifies the underlying objective by directly learning the average velocity field of the sample transport ODE, reducing the error induced by coarse discretization and enabling accurate generative modeling with as few as two sampling steps. The formal apparatus underlying MeanFlow models thus strikes an improved balance between expressivity and computational cost, motivating its adoption as a policy class in online RL.

Formulation: MeanFlow Policies in the MaxEnt RL Framework

The MFPO method represents policies as solutions to an ODE parameterized by the average velocity field $\boldsymbol{u}_\theta$ , enabling a generative mapping from prior noise to actions via a few-step integration (typically $T=2$ ). The model operates under the MaxEnt RL objective, incorporating both external reward and policy entropy into the optimization, and is optimized by soft policy iteration.

Key to this integration are:

Likelihood Approximation for Entropy and Policy Improvement: The change-of-variables theorem requires the log-determinant of the Jacobian, or equivalently, the integral over the divergence of the velocity field. However, this integral is intractable for non-trivial velocity parameterizations. MFPO proposes learning an average divergence network $\delta_\omega$ to efficiently approximate this average divergence via unbiased trace estimation (Skilling–Hutchinson) and automatic differentiation. This network yields high-fidelity likelihood estimates for use in entropy regularization and importance weight computation.
Policy Improvement via Adaptive SNIS: In the policy improvement step, the maximization over the action space is implemented by projecting the Boltzmann policy induced by the current $Q$ -function onto the MeanFlow policy. Since samples from the target Boltzmann distribution are not directly accessible, the method uses adaptive self-normalized importance sampling (SNIS), mixing two proposal distributions: a state-conditional Gaussian and the current policy itself, with adaptive weights assigned according to effective sample size (ESS). This estimator both mitigates estimator variance and efficiently exploits regions compatible with the learned policy and the Q function.
Figure 1: Normalized ESS and estimator variance for SNIS-based velocity estimation; policy-based proposals maintain high efficiency for large $t$ .

Experimental Analysis

Comparative Performance

MFPO is empirically evaluated on MuJoCo and DeepMind Control Suite locomotion benchmarks, against state-of-the-art diffusion RL (DIME, FlowRL, MaxEntDP, DACER, QVPO) and classical baselines (TD3, SAC). The results manifest several clear outcomes:

Sample Efficiency and Optimality: MFPO matches or exceeds the final normalized returns of all baselines across tasks, with strong sample efficiency. Notably, it achieves these outcomes using only two ODE steps, in contrast to 16–20 steps required by competing diffusion-based methods.
Training and Inference Time: MFPO cuts per-step inference latency by $2\times$ – $4\times$ relative to diffusion methods, nearly closing the gap with Gaussian policy approaches.
Figure 2: MFPO attains high normalized return with orders-of-magnitude faster training/inference time compared to diffusion and flow RL baselines.

Ablation and Hyperparameter Studies

Ablation experiments highlight the necessity of core MFPO components:

MeanFlow Objective vs. Standard Flow Matching: Replacing the average velocity objective with instantaneous velocity field matching induces significant discretization error, degrading data efficiency and policy return, particularly when limited to a small number of integration steps.
Average Divergence Network: Disabling the divergence estimator results in collapsed entropy estimates and unstable updates.
Adaptive SNIS: Combining Gaussian and policy proposals yields reduced estimator variance and improved policy updates, as confirmed by consistent performance across sampling ratio configurations.
Temperature Tuning: Adopting automatic temperature adjustment (SAC-style) ensures robust entropy control and stable convergence across environments and reward scales.
Figure 3: Ablations on HalfCheetah-v3 demonstrate effects of various velocity modeling, divergence estimation, proposal mixing ratios, and entropy targets.

Additional Analyses

Distributional Q-Learning (via C51) substantially improves performance and stability in the policy evaluation phase, compared with pointwise critics.
Test-Time Action Selection: Generating multiple policy samples and selecting the highest- $Q$ action further boosts final evaluation return.
Figure 4: Distributional Q-learning and multi-sample action selection each benefit policy improvement and stability.

Practical and Theoretical Implications

MFPO demonstrates that mean-field-based flow models, constructed to be compatible with the MaxEnt RL paradigm via learned divergence approximators and robust policy improvement estimators, can reconcile expressive policy gradients with real-time training and deployment constraints. This reframing challenges the assumption that efficient expressive policy learning necessitates deep ODE integration.

Theoretically, MFPO illustrates the importance of aligning generative policy learning objectives (average-velocity field fitting) with RL training requirements (tractable likelihoods for entropy and importance weighting), opening pathways for extending expressive RL policy families to even lower-latency regimes.

Conclusion

MFPO constitutes a technically principled synthesis of mean-velocity-field-based flows and MaxEnt RL, resolving long-standing challenges in tractable entropy estimation and high-variance policy improvement for expressive, multi-modal policies. The dual estimator system for policy improvement and likelihood estimation enables few-step policy evaluation and improvement without compromise to return or sample efficiency.

While a two-step integration presently remains necessary to maintain expressivity-performance trade-offs, future directions include the design of single-step mean flow policy models, advanced proposal mechanisms, or leveraging shortcut consistency frameworks for further acceleration.

Markdown Report Issue