Stochastic MeanFlow Policies (SMFP)

Updated 4 July 2026

Stochastic MeanFlow Policies are one-step generative control policies that map Gaussian noise to multimodal action distributions via a learned MeanFlow transformation.
They enhance sampling efficiency by reducing network evaluations from multiple steps to one, using techniques like entropic mirror descent and dispersive regularization.
Applications span robotic manipulation, continuous control, multi-agent RL, and offline RL, demonstrating reduced inference latency and increased policy expressiveness.

Searching arXiv for recent SMFP/MeanFlow policy papers to ground the article. Stochastic MeanFlow Policies (SMFP) are generative control policies that produce actions by sampling latent noise from a simple prior and transporting it to the action space through a learned MeanFlow transformation in one or a few long-range steps, rather than through iterative diffusion or finely discretized flow integration. In the most explicit usage, SMFP denotes a one-step MeanFlow policy class equipped with stochastic reparameterization and trained under an entropic mirror-descent objective for off-policy reinforcement learning (Wang et al., 20 May 2026). Related work uses adjacent names—such as Mean Velocity Policy, Mean Flow Policy, MeanFlow policy, DMPO, MFPO, SOM, and one-step MeanFlow policy—for closely related constructions in robotic manipulation, continuous control, offline RL, multi-agent RL, and RL post-training, even when the label “SMFP” itself is not used (Zhan et al., 14 Feb 2026).

1. Definition, scope, and terminology

The central object in SMFP is a state-conditioned transport map from a simple base distribution, typically Gaussian, to an action distribution. Stochasticity does not usually arise from an iterative reverse-time SDE, but from explicit sampling of a latent noise variable and, in some variants, from additional Gaussian perturbations or stochastic transition kernels. This distinguishes SMFP from deterministic one-step decoders and from multi-step diffusion or flow-matching policies that require tens to hundreds of neural function evaluations (Wang et al., 20 May 2026).

Terminology is not uniform across the literature. Some papers explicitly define “Stochastic MeanFlow Policies” as one-step generative policies that map Gaussian noise to actions through a MeanFlow transformation (Wang et al., 20 May 2026). Others state that the term does not explicitly appear in their paper and instead use “MeanFlow policies,” “Mean Flow Policy Optimization,” or task-specific names such as MVP, DMPO, or VGM $^2$ P, while still describing the same broad design principle: a stochastic pushforward policy realized by a learned average-velocity field over a probability path (Dong et al., 16 Apr 2026).

A recurrent source of confusion is the relation between SMFP and normalizing flows. MeanFlow policies are generally not trained as invertible likelihood models with tractable Jacobian determinants. Several formulations instead use regression on the MeanFlow identity, boundary constraints, score-derived targets, or value-guided objectives, and therefore trade exact likelihood tractability for fast one-step or few-step sampling and high policy expressiveness (Zhan et al., 14 Feb 2026).

2. Mathematical formulation and one-step generation

MeanFlow replaces the instantaneous velocity field of continuous-time flow matching with an average velocity over an interval. In one formulation, the mean velocity field is defined as

$u(a(t), t, r, s) \triangleq \frac{1}{r-t}\int_t^r v(a(\tau), \tau, s)\,\mathrm{d}\tau,$

so that one-step action generation becomes a single displacement from base noise to action (Zhan et al., 14 Feb 2026). Closely related papers use the same object with opposite time orientation, yielding equivalent one-step rules up to sign convention. As a result, the action may appear as either

$a = z + u_\theta(z,0,1,s)$

$a = z_1 - u_\theta(z_1,0,1,o),$

depending on whether the path is written from prior-to-data or data-to-prior (Zhan et al., 14 Feb 2026).

This formulation preserves the reparameterized-policy view. A latent variable, usually $z \sim \mathcal{N}(0,I)$ , is sampled once, and the network deterministically maps $(s,z)$ to an action. The induced policy is therefore the pushforward of a simple base measure through a nonlinear, state-conditioned map. Because the map need not be injective or invertible, the resulting action distribution can be highly non-Gaussian and multimodal, which is one of the principal motivations for using MeanFlow policies instead of unimodal Gaussian actors (Zou et al., 28 Jan 2026).

The MeanFlow identity links average and instantaneous velocities. For a linear interpolation path, one representative form is

$u(a(t),t,r,s) - (r-t)\frac{\mathrm{d}}{\mathrm{d}t}u(a(t),t,r,s) = v(a(t),t,s),$

with the total derivative implemented by a Jacobian-vector product. This identity is the basis for most MeanFlow training objectives, because it permits supervision of interval-averaged transport without numerically integrating the underlying ODE at inference time (Zhan et al., 14 Feb 2026).

The main algorithmic consequence is that inference complexity collapses from $O(T)$ network evaluations for a $T$ -step flow or diffusion sampler to $O(1)$ for a one-step MeanFlow policy. The one-step claim is therefore not merely descriptive; it is a statement about the sampling operator itself (Zhan et al., 14 Feb 2026).

3. Training objectives, constraints, and regularization

Most SMFP variants train the MeanFlow field by regressing to a target implied by the MeanFlow identity, usually with stop-gradient on the target branch. In MVP, the policy objective is

$u(a(t), t, r, s) \triangleq \frac{1}{r-t}\int_t^r v(a(\tau), \tau, s)\,\mathrm{d}\tau,$ 0

where $u(a(t), t, r, s) \triangleq \frac{1}{r-t}\int_t^r v(a(\tau), \tau, s)\,\mathrm{d}\tau,$ 1 is an instantaneous velocity constraint enforcing the boundary condition $u(a(t), t, r, s) \triangleq \frac{1}{r-t}\int_t^r v(a(\tau), \tau, s)\,\mathrm{d}\tau,$ 2 (Zhan et al., 14 Feb 2026). The role of this term is not merely empirical. The paper shows that, without an explicit boundary condition, the mean-flow identity admits a family of solutions with a free integration constant, so minimizing only the identity residual leaves a persistent ambiguity. IVC removes this degree of freedom and makes the mean-flow estimate well posed (Zhan et al., 14 Feb 2026).

A second training theme is representation collapse. One-step policies have no iterative correction stage, so poor internal representations translate directly into poor actions. DMPO and DM1 address this with dispersive regularization on conditional, temporal, or noise embeddings. DMPO studies InfoNCE-L2, InfoNCE-Cos, hinge, and covariance regularization, and reports that dispersion stabilizes one-step MeanFlow pre-training and reduces variance on harder tasks (Zou et al., 28 Jan 2026). DM1 applies dispersive losses to multiple intermediate embeddings and attributes improved manipulation performance to the prevention of representation collapse, especially when distinct observations would otherwise map to overly similar latent states (Zou et al., 9 Oct 2025).

A third line of work focuses on the stochastic structure of the actor itself. The paper explicitly titled “Stochastic MeanFlow Policies” augments the residual MeanFlow map with a diagonal Gaussian noise head,

$u(a(t), t, r, s) \triangleq \frac{1}{r-t}\int_t^r v(a(\tau), \tau, s)\,\mathrm{d}\tau,$ 3

which yields a tractable conditional entropy surrogate and supports entropy-regularized mirror descent without requiring the marginal density of the implicit policy (Wang et al., 20 May 2026). In that formulation, the entropy regularizer is implemented as a hinge floor on the average predicted log-scale, preventing variance collapse while avoiding an incentive for unbounded entropy growth (Wang et al., 20 May 2026).

Training dynamics themselves are also nontrivial. A separate analysis of MeanFlow training finds that well-established instantaneous velocity is a prerequisite for learning average velocity, that small-gap average-velocity supervision can help instantaneous-velocity learning, and that large-gap supervision should be emphasized only after accurate instantaneous and small-gap average velocities have formed (Kim et al., 24 Nov 2025). This suggests that apparently “one-step” SMFP often depend on a carefully staged training curriculum even when inference is strictly one-shot.

4. Reinforcement-learning formulations and representative variants

SMFP have been embedded in several distinct RL update rules. The variants differ less in the transport map than in how they define the target distribution, how they estimate likelihood or entropy, and how they couple policy improvement to critics or demonstrations.

Variant	Core mechanism	Setting
MVP	Mean-flow regression with IVC and best-of- $u(a(t), t, r, s) \triangleq \frac{1}{r-t}\int_t^r v(a(\tau), \tau, s)\,\mathrm{d}\tau,$ 4 critic selection	Offline-to-online robotic manipulation
DMPO	One-step MeanFlow, dispersive regularization, PPO fine-tuning	Robotics and locomotion
MFPO	Few-step MeanFlow under MaxEnt RL with average divergence network	Online continuous control
SOM	Score-based target velocity from $u(a(t), t, r, s) \triangleq \frac{1}{r-t}\int_t^r v(a(\tau), \tau, s)\,\mathrm{d}\tau,$ 5 via probability-flow ODE	Fully online RL
SMFP	Entropic mirror descent with diagonal Gaussian stochastic head	Off-policy online RL
VGM $u(a(t), t, r, s) \triangleq \frac{1}{r-t}\int_t^r v(a(\tau), \tau, s)\,\mathrm{d}\tau,$ 6P / LPS	Value-guided conditional MeanFlow or latent steering	Offline MARL / offline RL

MVP uses generate-and-select policy improvement: it samples multiple one-step candidate actions and chooses the highest- $u(a(t), t, r, s) \triangleq \frac{1}{r-t}\int_t^r v(a(\tau), \tau, s)\,\mathrm{d}\tau,$ 7 action, using the same mechanism for acting and for critic targets (Zhan et al., 14 Feb 2026). DMPO keeps one-step sampling but introduces two additional ingredients: dispersive regularization in stage-1 pre-training and PPO-based fine-tuning through an explicit Gaussian Markov chain in stage 2, which makes pathwise log-probabilities tractable even though the final marginal action density is not (Zou et al., 28 Jan 2026).

MFPO places MeanFlow in a maximum-entropy actor-critic framework. Because soft policy iteration requires entropy and likelihood terms, MFPO introduces an average divergence network $u(a(t), t, r, s) \triangleq \frac{1}{r-t}\int_t^r v(a(\tau), \tau, s)\,\mathrm{d}\tau,$ 8 that approximates the time-integrated divergence of the velocity field, thereby making log-likelihood estimation practical for a few-step MeanFlow sampler (Dong et al., 16 Apr 2026). By contrast, SOM avoids target-distribution samples altogether: it constructs the target velocity directly from the critic through a probability-flow ODE whose score term is estimated from a Boltzmann action distribution induced by $u(a(t), t, r, s) \triangleq \frac{1}{r-t}\int_t^r v(a(\tau), \tau, s)\,\mathrm{d}\tau,$ 9 (Kim et al., 22 May 2026).

The explicitly named SMFP formulation in online off-policy RL combines SAC-style exploration with mirror descent. Its actor loss has three pieces: a reparameterized $a = z + u_\theta(z,0,1,s)$ 0 term, a conditional-entropy surrogate, and an advantage-weighted MeanFlow regression term that realizes the mirror-descent projection without requiring KL computations between implicit marginals (Wang et al., 20 May 2026). This is one of the clearest formalizations of SMFP as a stochastic policy class rather than merely a stochastic sampling interpretation.

Beyond single-agent continuous control, MeanFlow policies have also been adapted to offline latent policy steering and multi-agent value-guided behavior cloning. LPS backpropagates action-space $a = z + u_\theta(z,0,1,s)$ 1-gradients through a differentiable one-step MeanFlow prior, eliminating proxy latent critics (Im et al., 5 Mar 2026). VGM $a = z + u_\theta(z,0,1,s)$ 2P uses conditional MeanFlow policies with classifier-free guidance and value-derived binary conditions under centralized training and decentralized execution, treating optimal joint policy learning as conditional behavior cloning (Pang et al., 9 Apr 2026).

5. Empirical characteristics, efficiency, and domains of application

A defining empirical pattern across the literature is that SMFP-like policies approach the expressivity of generative policies while retaining single-step or few-step latency. In robotic manipulation, MVP reports the highest average success rate across nine tasks, $a = z + u_\theta(z,0,1,s)$ 3, with average inference time $a = z + u_\theta(z,0,1,s)$ 4 ms versus $a = z + u_\theta(z,0,1,s)$ 5 ms for BFN and $a = z + u_\theta(z,0,1,s)$ 6 ms for QC, and average online training speed $a = z + u_\theta(z,0,1,s)$ 7 iterations/s versus $a = z + u_\theta(z,0,1,s)$ 8 for FQL, $a = z + u_\theta(z,0,1,s)$ 9 for QC, and $a = z_1 - u_\theta(z_1,0,1,o),$ 0 for BFN (Zhan et al., 14 Feb 2026). The same paper reports that naive one-step replacements for multi-step flow baselines fail on the hardest tasks, which indicates that one-step generation is effective only when paired with MeanFlow training and IVC rather than as a post hoc truncation (Zhan et al., 14 Feb 2026).

DMPO emphasizes real-time deployment. It reports one-step forward passes of $a = z_1 - u_\theta(z_1,0,1,o),$ 1– $a = z_1 - u_\theta(z_1,0,1,o),$ 2 ms on an NVIDIA RTX 4090, up to $a = z_1 - u_\theta(z_1,0,1,o),$ 3 Hz control frequency, and end-to-end one-step latency of $a = z_1 - u_\theta(z_1,0,1,o),$ 4 ms on an RTX 2080 in a Franka-Emika-Panda control loop, exceeding $a = z_1 - u_\theta(z_1,0,1,o),$ 5 Hz (Zou et al., 28 Jan 2026). MFPO, although few-step rather than strictly one-step, reports inference latency of $a = z_1 - u_\theta(z_1,0,1,o),$ 6 ms for $a = z_1 - u_\theta(z_1,0,1,o),$ 7 on MuJoCo and approximately $a = z_1 - u_\theta(z_1,0,1,o),$ 8 lower wall-clock training time than diffusion-based baselines while matching or exceeding their returns on MuJoCo and DeepMind Control Suite benchmarks (Dong et al., 16 Apr 2026).

In fully online RL, SOM reports the lowest training and inference latency among the listed baselines on HalfCheetah-v4, with $a = z_1 - u_\theta(z_1,0,1,o),$ 9 ms training time and $z \sim \mathcal{N}(0,I)$ 0 ms inference time, and achieves state-of-the-art results on four of five locomotion tasks with normalized score $z \sim \mathcal{N}(0,I)$ 1 (Kim et al., 22 May 2026). The explicit SMFP formulation under entropic mirror descent reports inference latency of $z \sim \mathcal{N}(0,I)$ 2 ms on Ant-v4 and the best average rank across seven MuJoCo tasks while maintaining one-step generation (Wang et al., 20 May 2026).

Robotic-manipulation variants reinforce the same pattern. MP1 reports average success $z \sim \mathcal{N}(0,I)$ 3 across $z \sim \mathcal{N}(0,I)$ 4 tasks, outperforming DP3 by $z \sim \mathcal{N}(0,I)$ 5 and FlowPolicy by $z \sim \mathcal{N}(0,I)$ 6, with average inference time $z \sim \mathcal{N}(0,I)$ 7 ms (Sheng et al., 14 Jul 2025). DM1 reports $z \sim \mathcal{N}(0,I)$ 8– $z \sim \mathcal{N}(0,I)$ 9 faster inference than multi-step baselines in simulation and a Lift success rate of $(s,z)$ 0 over $(s,z)$ 1 of the baseline, while retaining one-step MeanFlow generation (Zou et al., 9 Oct 2025). These results suggest that SMFP-like constructions are especially attractive where multimodal action structure and strict control-frequency constraints coexist.

6. Limitations, misconceptions, and open directions

Several limitations recur across the literature. First, exact likelihood and entropy remain difficult for implicit one-step policies. MFPO addresses this with an average divergence network, SMFP with a conditional Gaussian entropy lower bound, and DMPO with path-level Gaussian-chain likelihoods during PPO fine-tuning, but there is no single generally accepted solution (Dong et al., 16 Apr 2026). A common misconception is therefore that all MeanFlow policies are as tractable as Gaussian SAC policies; the cited work shows that tractability depends on additional architectural or variational machinery rather than on MeanFlow alone.

Second, one-step generation does not automatically dominate few-step methods. MFPO explicitly uses $(s,z)$ 2 and states that pushing to one step is future work, while several papers note that highly curved transport fields, tightly constrained action spaces, or delicate manifold structure may still favor a small number of refinement steps (Dong et al., 16 Apr 2026). This suggests that “one-step is enough” is best understood as a strong empirical claim for certain domains, not as a universal theorem.

Third, critic quality remains a bottleneck. SOM depends directly on $(s,z)$ 3-gradients to construct the target velocity; MFPO relies on accurate Boltzmann targets and effective self-normalized importance sampling; generate-and-select methods such as MVP can overconcentrate on critic errors if $(s,z)$ 4 is miscalibrated (Kim et al., 22 May 2026). Theoretical analyses typically provide local or lower-bound guarantees under bounded critic error, Lipschitz assumptions, or effective sample-size conditions, not global convergence guarantees for the full actor-critic system (Zhan et al., 14 Feb 2026).

Fourth, MeanFlow training itself can be memory-intensive because of Jacobian-vector products and derivative terms. Several papers identify JVP overhead, schedule sensitivity, or instability from large-gap supervision as practical obstacles (Zhan et al., 14 Feb 2026). OMP replaces JVP with a finite-difference “Differential Derivation Equation” approximation to reduce memory, at a small performance cost, while a broader MeanFlow training study recommends curricula that move progressively from instantaneous and small-gap supervision toward large-gap supervision (Fang et al., 22 Dec 2025).

A final direction concerns stochasticization of originally deterministic few-step flow maps. Flow-Map GRPO proposes Anchored Stochastic Flow Map Composition (ASFMC), which introduces anchor-based conditional resampling while preserving the original marginal probability path, thereby enabling likelihood ratios and RL post-training for deterministic MeanFlow-like generators (Li et al., 1 Jul 2026). This suggests a broader future for SMFP beyond continuous-control actors: stochastic MeanFlow mechanisms need not be native to the original parameterization, but can also be imposed post hoc if the resulting kernels preserve the underlying probability path.

Taken together, the literature presents SMFP as a family rather than a single algorithm: one-step or few-step MeanFlow transport maps driven by explicit stochastic reparameterization, critic-guided or demonstration-guided improvement, and auxiliary devices for uniqueness, representation stability, entropy control, or post-hoc stochasticization. The unifying idea is that average-velocity transport can preserve the expressive advantages of generative policies while reducing action generation to a single large transport step or a very small number of such steps.