MeanFlow Pretraining: Efficient One-Step Generation

Updated 19 January 2026

MeanFlow Pretraining is a generative modeling framework that directly regresses the time-averaged velocity field to achieve high-fidelity one-step sample generation while drastically reducing computational cost.
It leverages a differential identity linking instantaneous and average velocities, enabling stable and efficient training through gradient modulation and curriculum warmup strategies.
Recent extensions like Decoupled, Rectified, and Latent-Space MeanFlow have broadened its applications to image synthesis, trajectory modeling, and reinforcement learning.

MeanFlow Pretraining is a methodology for training one-step or few-step generative models via direct regression to the time-averaged velocity field between noise and data. The approach enables high-fidelity sampling in a single function evaluation (1 NFE), dramatically reducing computational cost compared to classical diffusion or flow models that require hundreds of integration steps. Central to MeanFlow is a differential identity linking the instantaneous velocity (as in flow-matching) and the average velocity over a finite interval, which serves as the core target for training. Recent advances—Modular MeanFlow, Decoupled MeanFlow, Rectified MeanFlow, and MeanFlow pretraining in latent or reinforcement learning settings—unify, generalize, and optimize this framework for image synthesis, trajectory modeling, and policy generation, achieving state-of-the-art quality and efficiency.

1. Mathematical Foundation: MeanFlow Identity and Average Velocity

Let $x_0 \sim p_{\mathrm{data}}$ (data), $x_1 \sim p_{\mathrm{prior}}$ (noise), and define interpolated states $x_t = (1-\alpha)x_0 + \alpha x_1$ for $t \in [0,1]$ , with $\alpha = (t-r)/(1-r)$ and $0\leq r < t \leq 1$ (You et al., 24 Aug 2025). The instantaneous velocity field is $v(x_t, t)$ , while the MeanFlow average velocity over $[r,t]$ is

$u(x_t,r,t) = \frac{1}{t-r} \int_{r}^{t} v(x_{\tau},\tau)\ d\tau.$

Crucially, the MeanFlow identity relates the two:

$v(x_t,t) = u(x_t,r,t) + (t-r) \frac{d}{dt}u(x_t,r,t),$

where $\frac{d}{dt}u = \partial_t u + (\nabla_x u)\cdot v(x_t,t)$ . For practical training, $v$ in the Jacobian term may be replaced by $u$ to yield the regression target:

$u + (t-r)\big(\partial_t u + \nabla_x u \cdot u\big) \approx \frac{x_1-x_0}{t-r}.$

This identity underpins all loss formulations for MeanFlow pretraining.

2. Loss Functions, Gradient Modulation, and Training Schedules

Several training losses have been introduced for stable MeanFlow pretraining:

Full second-order MeanFlow loss (not used in practice due to higher-order gradient cost):

$\mathcal{L}_{\text{full}}(\theta) = \mathbb{E}_{x_0, x_1, r < t} \left\| u_\theta(x_t,r,t) + (t-r)\left( \partial_t u_\theta + \nabla_x u_\theta \cdot u_\theta \right) - \frac{x_1-x_0}{t-r} \right\|^2$

(You et al., 24 Aug 2025).

Gradient-modulated MeanFlow loss: Introduces a partial stop-gradient operator $\mathrm{SG}_\lambda[z] = \lambda z + (1-\lambda)\mathrm{stopgrad}(z)$ , where $\lambda \in [0,1]$ controls gradient flow. The training objective becomes:

$\mathcal{L}_\lambda(\theta) = \mathbb{E}_{x_0, x_1, r < t} \left\| u_\theta(x_t, r, t) + (t - r)\ \mathrm{SG}_\lambda \left[ \partial_t u_\theta(x_t,r,t) + \nabla_x u_\theta(x_t,r,t)\cdot \left(\frac{x_1-x_0}{t-r}\right)\right] - \frac{x_1-x_0}{t-r} \right\|^2.$

Curriculum-style warmup: $\lambda$ is scheduled from $0$ (pure stop-gradient, "coarse" estimation) to $1$ (full backprop, "maximum expressivity") over an initial warmup phase, then fixed at $1$. This procedure allows the network to first stabilize on simple targets before learning detailed dynamics, which is essential to avoid gradient explosion (You et al., 24 Aug 2025). A linear schedule is given by $\lambda(s) = \min(1,\, s / T_{\text{warmup}})$ over training step $s$ .

Empirically, improper settings of $\lambda$ (e.g., jumping to $1$ too early or staying too low) lead to instability or underfitting. Monitoring gradient norms and the schedule is crucial (You et al., 24 Aug 2025).

3. Pretraining Pipelines: Variants and Structural Optimizations

Recent work has generalized and structurally optimized the MeanFlow paradigm:

Joint MeanFlow Training (Kim et al., 24 Nov 2025): Jointly trains both instantaneous ( $v_\theta$ ) and average ( $u_\theta$ ) velocity heads:

$\mathcal{L}_v = \mathbb{E} \left[ \|v_\theta(z_t, t) - (\epsilon - x)\|_2^2 \right],\ \mathcal{L}_u = \mathbb{E} \left[ \| u_\theta(z_t, r, t) - \mathrm{sg}(u_{\text{tgt}}(z_t; r, t))\|_2^2 \right]$

with $u_{\text{tgt}}$ a differentiable target using the learnable $v_\theta$ . Crucially, accurate $v_\theta$ formation is a hard prerequisite for good $u_\theta$ learning, necessitating a curriculum that accelerates $v_\theta$ and schedules gap sizes ( $\Delta t = t-r$ ) in $u_\theta$ (Kim et al., 24 Nov 2025).

Decoupled MeanFlow (DMF) (Lee et al., 28 Oct 2025): Reinterprets pretrained DiT (diffusion transformer) flow models as flow-maps (average velocity predictors) without architectural change. The DiT backbone is split into encoder (sees $t$ embedding) and decoder (sees $r$ embedding); only the second-stage decoder is retrained for arbitrary interval jumps. Training combines flow-matching and MeanFlow losses with adaptive Cauchy loss reweighting.
Rectified MeanFlow (Re-MeanFlow) (Zhang et al., 28 Nov 2025): Addresses MeanFlow’s difficulty on curved flows by reparameterizing couplings using a single “reflow” step—first training a flow model, generating rectified couplings via ODE solving, truncating highly curved pairs, then training the MeanFlow predictor. This preconditioning improves convergence and sample quality.
Latent-Space and RAE-based MeanFlow (Hu et al., 17 Nov 2025): MeanFlow is combined with frozen, semantically rich representation autoencoders (e.g., DINO-based). Naive MF training in latent space causes gradient explosion; to counteract this, Consistency Mid-Training initializes the MeanFlow predictor along ODE trajectories from a pre-trained teacher. Two-stage training (distillation, then bootstrapping) provides stable, efficient 1-step generation.
Reinforcement Learning Policies (Wang et al., 17 Nov 2025): Original two-stage MeanFlow Q-learning involves velocity pretraining and distillation. A residual reformulation unifies this into a single policy network $g_\theta(s,a_t,b,t) = a_t - u_\theta(s,a_t,b,t)$ , avoiding the expressivity bottleneck of distillation and supporting stable policy learning in a single stage.

4. Empirical Results and Training Dynamics

MeanFlow pretraining delivers high one-step sample quality (FID), fast convergence, and robust performance:

Model / Setting	Dataset / Size	1-Step FID ↓	2-Step FID ↓	Notes
Modular MeanFlow (MMF)	ImageNet 256×256	3.43	2.93	DiT-XL/2 backbone (You et al., 24 Aug 2025)
Accelerated MF (DTD)	ImageNet 256×256	2.87	2.64	Task affinity + curriculum (Kim et al., 24 Nov 2025)
Decoupled MF (DMF-XL/2+)	ImageNet 256×256	2.16	1.64	No arch change, Cauchy loss (Lee et al., 28 Oct 2025)
Rectified MF (Re-MeanFlow)	ImageNet 256×256	3.41	-	Truncated, 1 reflow (Zhang et al., 28 Nov 2025)
MF+RAE (distill)	ImageNet 256×256	2.03	1.89	RAE latent, no guidance (Hu et al., 17 Nov 2025)

Re-MeanFlow on ImageNet 64 shows FID = 2.87 (EDM2-S), matching or outperforming previous few-step flow-based methods. In the RL context, residual MeanFlow policies outperform prior two-stage flow-policy baselines on 65/73 OGBench and D4RL tasks, maintaining near-zero “bound loss” and expressivity for multimodal action distributions (Wang et al., 17 Nov 2025).

Empirical dynamics highlight that accurate instantaneous velocity learning must precede average velocity, and small-gap average velocity supervision stabilizes the progression towards large-jump one-step models (Kim et al., 24 Nov 2025). Failure to respect these curriculum constraints destabilizes or slows convergence.

5. Implementation and Practical Strategies

Key hyperparameters and strategies common to successful MeanFlow pretraining:

Batch sizes: typically $128$–$512$ (larger batches recommended during initial warmup) (You et al., 24 Aug 2025, Hu et al., 17 Nov 2025).
Learning rates: $10^{-4}$ with cosine decay; Adam or AdamW optimizer (You et al., 24 Aug 2025, Lee et al., 28 Oct 2025, Hu et al., 17 Nov 2025).
Exponential moving average (EMA) parameter stabilization is consistently beneficial (e.g., $\beta=0.9999$ ) (Zhang et al., 28 Nov 2025).
Curriculum over interval gap: linearly or progressively increase gap size in $u_\theta$ head (e.g., $\beta(\Delta t, s)=1-s+\lambda s(1-\Delta t)$ ) (Kim et al., 24 Nov 2025).
For efficiency, Jacobian-vector-products (JVP) can be computed via forward-mode autodiff; finite-difference approximations can stabilize weakly trained teachers in latent-space MeanFlow (Hu et al., 17 Nov 2025).
Distance truncation in Re-MeanFlow avoids instability from highly curved rectified paths; 10% discarding is usually effective (Zhang et al., 28 Nov 2025).

In Decoupled MeanFlow, architectural reuse (encoder/decoder split) and combined FM/MF losses enable direct upgrading of pretrained diffusion models without additional weights or layers (Lee et al., 28 Oct 2025).

6. Applications and Extensions: Image Synthesis, Latent Models, and Policy Learning

MeanFlow pretraining is widely adopted in:

Image synthesis: Modular MeanFlow, DMF, Re-MeanFlow, and latent-space MF approaches all achieve state-of-the-art FID in one or few steps on ImageNet and other datasets (You et al., 24 Aug 2025, Lee et al., 28 Oct 2025, Zhang et al., 28 Nov 2025, Hu et al., 17 Nov 2025).
Latent generative models: MeanFlow leverages powerful representation autoencoders for semantically meaningful generation, significantly reducing computational requirements compared to SD-VAE pipelines (Hu et al., 17 Nov 2025).
Offline RL policy learning: Residual MeanFlow reformulation (one-step Q-learning) enables fast, expressive, stable policy deployment in both tabular and continuous-control settings, outperforming both Gaussian and compositionally distilled flow policies (Wang et al., 17 Nov 2025).

Curriculum-based, modular, and decoupled MeanFlow strategies continue to generalize as foundational techniques for scalable, efficient, and robust generative modeling.

7. Limitations, Pitfalls, and Best Practices

MeanFlow training can exhibit instability due to higher-order gradients unless proper gradient modulation, curriculum scheduling, and architectural choices are applied (You et al., 24 Aug 2025, Kim et al., 24 Nov 2025). Key pitfalls:

Jumping directly to full gradient backpropagation ( $\lambda=1$ ) can cause oscillatory, divergent losses.
Small $\lambda$ throughout restricts expressivity, leading to poor FID.
Instability with very small interval gaps $(t-r)$ ; can be mitigated by thresholding $t-r$ and appropriately reweighting loss terms (You et al., 24 Aug 2025).
In latent-space MF, naive initialization without trajectory-aware warm-start (e.g., CMT) leads to severe gradient explosion (Hu et al., 17 Nov 2025).

Best practices include:

Monitoring both the regression loss and gradient norms throughout curriculum ramp-up.
Applying distance truncation and adaptive loss scaling as needed.
Optionally fine-tuning on long-intervals ( $r=0$ , $t=1$ ) late in training to polish one-step accuracy.
Using adaptive schedules for curriculum and guidance mixing.

MeanFlow pretraining provides a mathematically principled, empirically robust approach for one-step generative modeling, acceleration of flow-based architectures, and efficient reinforcement learning policy generation—with modular, scalable strategies now established across the generative modeling landscape (You et al., 24 Aug 2025, Kim et al., 24 Nov 2025, Lee et al., 28 Oct 2025, Zhang et al., 28 Nov 2025, Hu et al., 17 Nov 2025, Wang et al., 17 Nov 2025).

Markdown Upgrade to Chat

References (6)

Modular MeanFlow: Towards Stable and Scalable One-Step Generative Modeling (2025)

Understanding, Accelerating, and Improving MeanFlow Training (2025)

Decoupled MeanFlow: Turning Flow Models into Flow Maps for Accelerated Sampling (2025)

Flow Straighter and Faster: Efficient One-Step Generative Modeling via MeanFlow on Rectified Trajectories (2025)

MeanFlow Transformers with Representation Autoencoders (2025)

One-Step Generative Policies with Q-Learning: A Reformulation of MeanFlow (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MeanFlow Pretraining.