MeanFlow: One-Step Generative Modeling

Updated 24 November 2025

MeanFlow is a generative modeling method that leverages time-averaged velocity fields to enable one-step or few-step data generation, reducing computational costs.
It introduces a novel training objective by minimizing the error between a neural approximation of the average velocity and its instantaneous counterpart, ensuring theoretical consistency.
The framework generalizes traditional diffusion and flow models, demonstrating state-of-the-art performance in image synthesis, speech enhancement, and robotic applications.

MeanFlow (MF) is a generative modeling framework based on learning time-averaged velocity fields, designed to enable efficient one-step or few-step data generation with competitive sample quality relative to conventional multi-step diffusion and flow-based models. MeanFlow achieves this by directly parameterizing the average (mean) velocity that transports a sample across finite time intervals, rather than relying on the instantaneous velocity fields that necessitate costly ODE or SDE integration. This approach yields a generative model architecture that supports native one-step sampling, theoretical consistency with continuous-time generative mechanisms, and broad applicability across modalities, architectures, and downstream tasks.

1. Mathematical Foundations of the MeanFlow Framework

MeanFlow generalizes classic flow-matching and diffusion models, which describe generative processes via a time-indexed trajectory $z_t$ governed by an ordinary differential equation: $\frac{d z_t}{dt} = v(z_t, t), \quad z_0 \sim p_0, \; z_1 \sim p_1$ where $v(z_t, t)$ is a learnable instantaneous velocity field mapping the current state and time to a vector displacement. Conventional flow-matching learns $v$ by minimizing a squared error loss with respect to a known analytic trajectory or data-prior pairing, and inference requires discretizing the ODE into many integration steps (Euler, Runge–Kutta).

MeanFlow introduces the concept of the average velocity field $u(z_t, r, t)$ over an interval $[r, t]$ , defined as: $u(z_t, r, t) = \frac{1}{t - r} \int_r^t v(z_\tau, \tau) d\tau$ This quantity captures the net displacement between two timepoints, normalized by the interval length. Crucially, MeanFlow leverages an identity connecting the average and instantaneous velocities: $u(z_t, r, t) = v(z_t, t) - (t - r) \frac{d}{dt} u(z_t, r, t)$ where the total derivative is

$\frac{d}{dt} u(z_t, r, t) = v(z_t, t) \cdot \nabla_z u(z_t, r, t) + \partial_t u(z_t, r, t)$

Neural approximation of $u_\theta$ can therefore be guided by minimizing the squared error between $u_\theta$ and the right-hand side of the identity, using analytic $v(z_t, t)$ where available or tractable conditional velocities (Geng et al., 19 May 2025).

2. Training Objective, Loss Family, and Practical Algorithms

The canonical MeanFlow loss is: $\mathcal{L}(\theta) = \mathbb{E}_{(x, e, r, t)} \| u_\theta(z_t, r, t) - \textrm{stopgrad}(v_t - (t-r)[v_t \cdot \nabla_z u_\theta + \partial_t u_\theta]) \|^2$ where $z_t = (1-t)x + t e$ , $v_t = e - x$ , and $e \sim \mathcal{N}(0, I)$ . The stop-gradient ensures stability during optimization by preventing backpropagation through the target's JVP computation (Geng et al., 19 May 2025).

Variants, such as those introduced in Modular MeanFlow (MMF), interpolate between first-order (consistency) losses and "full" MeanFlow via partial stop-gradient, with a curriculum schedule for the gradient coefficient $\lambda$ : $\mathcal{L}_\lambda = \mathbb{E} \Big\| u_\theta(x_t, r, t) + (t - r) \, \textrm{SG}_\lambda\Big[\partial_t u_\theta + \nabla_x u_\theta \cdot \frac{x_1 - x_0}{t - r}\Big] - \frac{x_1 - x_0}{t - r} \Big\|^2$ where $\textrm{SG}_\lambda[z] = \lambda z + (1 - \lambda) [z]_{\textrm{stopgrad}}$ and $\lambda$ is annealed from 0 to 1 (You et al., 24 Aug 2025).

Curriculum and adaptive loss weighting schemes are recurrent themes, allowing a smooth transition from stable, first-order consistency enforcement to expressive, higher-order MeanFlow learning. AlphaFlow generalizes this further, unifying shortcut, flow-matching, and MeanFlow under a single hyperparameter $\alpha$ to control the relative weighting and mitigate gradient antagonism (Zhang et al., 23 Oct 2025).

For stability and robustness, an adaptive loss weight is often introduced: $w = (\|u_\theta - \textrm{tgt}\|^2 + c)^{-(1-\gamma)},\quad \gamma \in (0,1),\, c > 0$ This downweights outlier samples, improving convergence (Zhu et al., 27 Sep 2025).

3. Inference Schemes, Efficiency, and One-Step Generation

At inference, MeanFlow eliminates the need for multi-step ODE solvers: $z_0 = z_1 - u_\theta(z_1, 0, 1)$ where $z_1$ is a sample from a simple prior (commonly $\mathcal{N}(0, I)$ ), and $u_\theta(z_1, 0, 1)$ estimates the average velocity to transport from noise to data in a single step (Geng et al., 19 May 2025). The same principle underlies recent approaches in speech enhancement (Zhu et al., 27 Sep 2025), image and audio synthesis (Yang et al., 8 Sep 2025), RL policy generation (Wang et al., 17 Nov 2025), and robotic manipulation (Zou et al., 9 Oct 2025).

This paradigm yields state-of-the-art real-time factors (RTF) for generative speech enhancement (RTF $\simeq 0.013$ vs $0.12$–$1.4$ for multistep models), one-step FID on ImageNet-256 ( $\sim 2.2$ –$2.5$ with ViT-XL/2), and action generation latencies in robotic RL reduced by 20–40 $\times$ over ODE- or diffusion-based policies (Zhu et al., 27 Sep 2025, Wang et al., 17 Nov 2025, Zou et al., 9 Oct 2025).

MeanFlow also subsumes few-step numerical updates: with more steps and adaptive intervals, accuracy further approaches multi-step ODE solvers, but empirical studies consistently show that MeanFlow's one- and two-step performance narrows or closes the sample quality gap to multi-step baselines at orders-of-magnitude lower computational cost (Geng et al., 19 May 2025, Lee et al., 28 Oct 2025).

4. Extensions, Architectural Adaptations, and Application Domains

MeanFlow's formalism is applicable to finite and infinite dimensions:

Functional Mean Flow (FMF): Extension to Hilbert spaces for functional data, enabling one-step generation of time series, PDE solutions, 3D geometry, and images at arbitrary resolution (Li et al., 17 Nov 2025).
Conditional and Multimodal Models: For speech and audio, MF conditions on SSL representations rather than spectrograms or VAE latents, improving perceptual quality and intelligibility (Zhu et al., 27 Sep 2025, Li et al., 18 Sep 2025, Yang et al., 8 Sep 2025).
Residual and Decoupled Architectures: RL and robotics models employ residual parameterizations and dispersive regularization to stabilize training and prevent representational collapse (Wang et al., 17 Nov 2025, Zou et al., 9 Oct 2025). Decoupled MeanFlow divides time conditioning between encoder and decoder blocks, enabling training from pretrained flow models and achieving strong 1- and 4-step FID on ImageNet (Lee et al., 28 Oct 2025).
Representation Autoencoders: When combined with pre-trained, semantically rich RAEs, MeanFlow achieves high sample quality with reduced GFLOPS and stable training even in high-dimensional latent spaces (Hu et al., 17 Nov 2025).
Discriminative Compression (MFI-ResNet): MeanFlow compression of ResNet stages enables over 45% parameter reduction with improved accuracy, linking generative flow matching to standard discriminative architectures (Sun et al., 16 Nov 2025).

The table below summarizes representative domains and adaptations:

Domain	Key MF Mechanism	Ref
ImageGen	DiT backbone, VAE/RAE latents	(Geng et al., 19 May 2025, Hu et al., 17 Nov 2025)
Audio/Speech	SSL conditioning, conditional MF	(Zhu et al., 27 Sep 2025, Yang et al., 8 Sep 2025)
RL and Robotics	Residual mapping, dispersive reg.	(Wang et al., 17 Nov 2025, Zou et al., 9 Oct 2025)
Functional Data	Hilbert-space MF, FNO architectures	(Li et al., 17 Nov 2025)
CV Classifiers	MF stage compression, incubated ResNets	(Sun et al., 16 Nov 2025)

5. Theoretical Properties, Guarantees, and Analysis

The MeanFlow identity guarantees that the average velocity field contains all information necessary to reconstruct the instantaneous velocity, and that as $(t - r) \to 0$ , $u(z_t, r, t) \to v(z_t, t)$ , making flow matching a special case of MF (Geng et al., 19 May 2025, You et al., 24 Aug 2025). Path-consistency identities ensure that the average velocity across non-overlapping intervals composes correctly: $(t - r)u(z_t, r, t) = (s - r) u(z_s, r, s) + (t - s) u(z_t, s, t)$ (You et al., 24 Aug 2025). Theoretical extensions in Hilbert space ensure existence and differentiability under mild assumptions, and loss equivalence between conditional and marginal formulations up to additive constants (Li et al., 17 Nov 2025).

In-depth analysis uncovers antagonistic gradients between trajectory flow matching and consistency terms in the MF loss, motivating curriculum strategies (alpha-scheduling in $\alpha$ -Flow) to avoid optimization conflict and accelerate convergence (Zhang et al., 23 Oct 2025). This disentangling achieves improved FID, lower Fréchet Distance (FDD), and faster training across a range of vanilla DiT backbones.

6. Empirical Results and Comparative Evaluation

MeanFlow variants achieve state-of-the-art or highly competitive performance across modalities:

Image Synthesis (ImageNet 256): MF-XL/2 attains 1-NFE FID of 3.43 and 2-NFE FID of 2.93. $\alpha$ -Flow and Decoupled MF models improve 1-step FID to 2.16–2.58, surpassing prior few-step approaches (Geng et al., 19 May 2025, Lee et al., 28 Oct 2025, Zhang et al., 23 Oct 2025).
Robotic Manipulation: DM1 achieves 97–99% success on Lift (vs. 84–85% for baselines) and lowers inference latency by an order of magnitude (Zou et al., 9 Oct 2025).
Speech Enhancement: MeanFlowSE yields DNSMOS OVRL $>3.36$ , WER 8.5% (vs. 12.4%+ for competitors), and RTF $<$ 0.015—enabling real-time, high-fidelity enhancement (Zhu et al., 27 Sep 2025).
Functional Generation (FMF): FMF matches multi-step functional diffusion baselines on time series, PDEs, and 3D shape reconstruction in one step (Li et al., 17 Nov 2025).

Across all domains, ablations confirm that MF's one-step paradigm delivers the best efficiency/quality trade-off, and fine-grained regularization strategies further enhance stability and generalization.

7. Limitations, Open Questions, and Future Research

Despite its generality and empirical success, current MeanFlow directions have noted open limitations:

Most large-scale successes use latent-space (VAE/RAE) domains rather than direct pixel or waveform space; extension to full-resolution direct generation remains active.
Stability depends on careful loss weighting, stop-gradient insertion, and sometimes multi-stage or curriculum training, particularly in high-dimensional or ill-conditioned settings (Hu et al., 17 Nov 2025, You et al., 24 Aug 2025).
Streaming, online, or low-latency extensions for speech and spatial/multimodal data are under-explored (Zhu et al., 27 Sep 2025).
Theoretical quantification of the approximation error between conditional and marginal velocity targets remains incomplete (Geng et al., 19 May 2025).
Adaptive scaling of loss components, joint training with larger representation models, and cross-modal extensions represent current research frontiers (Li et al., 17 Nov 2025, Wang et al., 17 Nov 2025).

MeanFlow has established a general, mathematically principled foundation for efficient, high-fidelity, one-step generative modeling—spanning images, audio, functional data, policies, and classifier architectures—while motivating further innovation in loss design, representation learning, and stability enhancement (Geng et al., 19 May 2025, Lee et al., 28 Oct 2025, You et al., 24 Aug 2025, Zhang et al., 23 Oct 2025, Li et al., 17 Nov 2025, Hu et al., 17 Nov 2025, Zou et al., 9 Oct 2025).