Papers
Topics
Authors
Recent
Search
2000 character limit reached

MeanFlow Pretraining: Efficient One-Step Generation

Updated 19 January 2026
  • MeanFlow Pretraining is a generative modeling framework that directly regresses the time-averaged velocity field to achieve high-fidelity one-step sample generation while drastically reducing computational cost.
  • It leverages a differential identity linking instantaneous and average velocities, enabling stable and efficient training through gradient modulation and curriculum warmup strategies.
  • Recent extensions like Decoupled, Rectified, and Latent-Space MeanFlow have broadened its applications to image synthesis, trajectory modeling, and reinforcement learning.

MeanFlow Pretraining is a methodology for training one-step or few-step generative models via direct regression to the time-averaged velocity field between noise and data. The approach enables high-fidelity sampling in a single function evaluation (1 NFE), dramatically reducing computational cost compared to classical diffusion or flow models that require hundreds of integration steps. Central to MeanFlow is a differential identity linking the instantaneous velocity (as in flow-matching) and the average velocity over a finite interval, which serves as the core target for training. Recent advances—Modular MeanFlow, Decoupled MeanFlow, Rectified MeanFlow, and MeanFlow pretraining in latent or reinforcement learning settings—unify, generalize, and optimize this framework for image synthesis, trajectory modeling, and policy generation, achieving state-of-the-art quality and efficiency.

1. Mathematical Foundation: MeanFlow Identity and Average Velocity

Let x0pdatax_0 \sim p_{\mathrm{data}} (data), x1ppriorx_1 \sim p_{\mathrm{prior}} (noise), and define interpolated states xt=(1α)x0+αx1x_t = (1-\alpha)x_0 + \alpha x_1 for t[0,1]t \in [0,1], with α=(tr)/(1r)\alpha = (t-r)/(1-r) and 0r<t10\leq r < t \leq 1 (You et al., 24 Aug 2025). The instantaneous velocity field is v(xt,t)v(x_t, t), while the MeanFlow average velocity over [r,t][r,t] is

u(xt,r,t)=1trrtv(xτ,τ) dτ.u(x_t,r,t) = \frac{1}{t-r} \int_{r}^{t} v(x_{\tau},\tau)\ d\tau.

Crucially, the MeanFlow identity relates the two:

v(xt,t)=u(xt,r,t)+(tr)ddtu(xt,r,t),v(x_t,t) = u(x_t,r,t) + (t-r) \frac{d}{dt}u(x_t,r,t),

where ddtu=tu+(xu)v(xt,t)\frac{d}{dt}u = \partial_t u + (\nabla_x u)\cdot v(x_t,t). For practical training, vv in the Jacobian term may be replaced by uu to yield the regression target:

u+(tr)(tu+xuu)x1x0tr.u + (t-r)\big(\partial_t u + \nabla_x u \cdot u\big) \approx \frac{x_1-x_0}{t-r}.

This identity underpins all loss formulations for MeanFlow pretraining.

2. Loss Functions, Gradient Modulation, and Training Schedules

Several training losses have been introduced for stable MeanFlow pretraining:

  • Full second-order MeanFlow loss (not used in practice due to higher-order gradient cost):

Lfull(θ)=Ex0,x1,r<tuθ(xt,r,t)+(tr)(tuθ+xuθuθ)x1x0tr2\mathcal{L}_{\text{full}}(\theta) = \mathbb{E}_{x_0, x_1, r < t} \left\| u_\theta(x_t,r,t) + (t-r)\left( \partial_t u_\theta + \nabla_x u_\theta \cdot u_\theta \right) - \frac{x_1-x_0}{t-r} \right\|^2

(You et al., 24 Aug 2025).

  • Gradient-modulated MeanFlow loss: Introduces a partial stop-gradient operator SGλ[z]=λz+(1λ)stopgrad(z)\mathrm{SG}_\lambda[z] = \lambda z + (1-\lambda)\mathrm{stopgrad}(z), where λ[0,1]\lambda \in [0,1] controls gradient flow. The training objective becomes:

Lλ(θ)=Ex0,x1,r<tuθ(xt,r,t)+(tr) SGλ[tuθ(xt,r,t)+xuθ(xt,r,t)(x1x0tr)]x1x0tr2.\mathcal{L}_\lambda(\theta) = \mathbb{E}_{x_0, x_1, r < t} \left\| u_\theta(x_t, r, t) + (t - r)\ \mathrm{SG}_\lambda \left[ \partial_t u_\theta(x_t,r,t) + \nabla_x u_\theta(x_t,r,t)\cdot \left(\frac{x_1-x_0}{t-r}\right)\right] - \frac{x_1-x_0}{t-r} \right\|^2.

  • Curriculum-style warmup: λ\lambda is scheduled from $0$ (pure stop-gradient, "coarse" estimation) to $1$ (full backprop, "maximum expressivity") over an initial warmup phase, then fixed at $1$. This procedure allows the network to first stabilize on simple targets before learning detailed dynamics, which is essential to avoid gradient explosion (You et al., 24 Aug 2025). A linear schedule is given by λ(s)=min(1,s/Twarmup)\lambda(s) = \min(1,\, s / T_{\text{warmup}}) over training step ss.

Empirically, improper settings of λ\lambda (e.g., jumping to $1$ too early or staying too low) lead to instability or underfitting. Monitoring gradient norms and the schedule is crucial (You et al., 24 Aug 2025).

3. Pretraining Pipelines: Variants and Structural Optimizations

Recent work has generalized and structurally optimized the MeanFlow paradigm:

  • Joint MeanFlow Training (Kim et al., 24 Nov 2025): Jointly trains both instantaneous (vθv_\theta) and average (uθu_\theta) velocity heads:

Lv=E[vθ(zt,t)(ϵx)22], Lu=E[uθ(zt,r,t)sg(utgt(zt;r,t))22]\mathcal{L}_v = \mathbb{E} \left[ \|v_\theta(z_t, t) - (\epsilon - x)\|_2^2 \right],\ \mathcal{L}_u = \mathbb{E} \left[ \| u_\theta(z_t, r, t) - \mathrm{sg}(u_{\text{tgt}}(z_t; r, t))\|_2^2 \right]

with utgtu_{\text{tgt}} a differentiable target using the learnable vθv_\theta. Crucially, accurate vθv_\theta formation is a hard prerequisite for good uθu_\theta learning, necessitating a curriculum that accelerates vθv_\theta and schedules gap sizes (Δt=tr\Delta t = t-r) in uθu_\theta (Kim et al., 24 Nov 2025).

  • Decoupled MeanFlow (DMF) (Lee et al., 28 Oct 2025): Reinterprets pretrained DiT (diffusion transformer) flow models as flow-maps (average velocity predictors) without architectural change. The DiT backbone is split into encoder (sees tt embedding) and decoder (sees rr embedding); only the second-stage decoder is retrained for arbitrary interval jumps. Training combines flow-matching and MeanFlow losses with adaptive Cauchy loss reweighting.
  • Rectified MeanFlow (Re-MeanFlow) (Zhang et al., 28 Nov 2025): Addresses MeanFlow’s difficulty on curved flows by reparameterizing couplings using a single “reflow” step—first training a flow model, generating rectified couplings via ODE solving, truncating highly curved pairs, then training the MeanFlow predictor. This preconditioning improves convergence and sample quality.
  • Latent-Space and RAE-based MeanFlow (Hu et al., 17 Nov 2025): MeanFlow is combined with frozen, semantically rich representation autoencoders (e.g., DINO-based). Naive MF training in latent space causes gradient explosion; to counteract this, Consistency Mid-Training initializes the MeanFlow predictor along ODE trajectories from a pre-trained teacher. Two-stage training (distillation, then bootstrapping) provides stable, efficient 1-step generation.
  • Reinforcement Learning Policies (Wang et al., 17 Nov 2025): Original two-stage MeanFlow Q-learning involves velocity pretraining and distillation. A residual reformulation unifies this into a single policy network gθ(s,at,b,t)=atuθ(s,at,b,t)g_\theta(s,a_t,b,t) = a_t - u_\theta(s,a_t,b,t), avoiding the expressivity bottleneck of distillation and supporting stable policy learning in a single stage.

4. Empirical Results and Training Dynamics

MeanFlow pretraining delivers high one-step sample quality (FID), fast convergence, and robust performance:

Model / Setting Dataset / Size 1-Step FID ↓ 2-Step FID ↓ Notes
Modular MeanFlow (MMF) ImageNet 256×256 3.43 2.93 DiT-XL/2 backbone (You et al., 24 Aug 2025)
Accelerated MF (DTD) ImageNet 256×256 2.87 2.64 Task affinity + curriculum (Kim et al., 24 Nov 2025)
Decoupled MF (DMF-XL/2+) ImageNet 256×256 2.16 1.64 No arch change, Cauchy loss (Lee et al., 28 Oct 2025)
Rectified MF (Re-MeanFlow) ImageNet 256×256 3.41 - Truncated, 1 reflow (Zhang et al., 28 Nov 2025)
MF+RAE (distill) ImageNet 256×256 2.03 1.89 RAE latent, no guidance (Hu et al., 17 Nov 2025)

Re-MeanFlow on ImageNet 64 shows FID = 2.87 (EDM2-S), matching or outperforming previous few-step flow-based methods. In the RL context, residual MeanFlow policies outperform prior two-stage flow-policy baselines on 65/73 OGBench and D4RL tasks, maintaining near-zero “bound loss” and expressivity for multimodal action distributions (Wang et al., 17 Nov 2025).

Empirical dynamics highlight that accurate instantaneous velocity learning must precede average velocity, and small-gap average velocity supervision stabilizes the progression towards large-jump one-step models (Kim et al., 24 Nov 2025). Failure to respect these curriculum constraints destabilizes or slows convergence.

5. Implementation and Practical Strategies

Key hyperparameters and strategies common to successful MeanFlow pretraining:

  • Batch sizes: typically $128$–$512$ (larger batches recommended during initial warmup) (You et al., 24 Aug 2025, Hu et al., 17 Nov 2025).
  • Learning rates: 10410^{-4} with cosine decay; Adam or AdamW optimizer (You et al., 24 Aug 2025, Lee et al., 28 Oct 2025, Hu et al., 17 Nov 2025).
  • Exponential moving average (EMA) parameter stabilization is consistently beneficial (e.g., β=0.9999\beta=0.9999) (Zhang et al., 28 Nov 2025).
  • Curriculum over interval gap: linearly or progressively increase gap size in uθu_\theta head (e.g., β(Δt,s)=1s+λs(1Δt)\beta(\Delta t, s)=1-s+\lambda s(1-\Delta t)) (Kim et al., 24 Nov 2025).
  • For efficiency, Jacobian-vector-products (JVP) can be computed via forward-mode autodiff; finite-difference approximations can stabilize weakly trained teachers in latent-space MeanFlow (Hu et al., 17 Nov 2025).
  • Distance truncation in Re-MeanFlow avoids instability from highly curved rectified paths; 10% discarding is usually effective (Zhang et al., 28 Nov 2025).

In Decoupled MeanFlow, architectural reuse (encoder/decoder split) and combined FM/MF losses enable direct upgrading of pretrained diffusion models without additional weights or layers (Lee et al., 28 Oct 2025).

6. Applications and Extensions: Image Synthesis, Latent Models, and Policy Learning

MeanFlow pretraining is widely adopted in:

  • Image synthesis: Modular MeanFlow, DMF, Re-MeanFlow, and latent-space MF approaches all achieve state-of-the-art FID in one or few steps on ImageNet and other datasets (You et al., 24 Aug 2025, Lee et al., 28 Oct 2025, Zhang et al., 28 Nov 2025, Hu et al., 17 Nov 2025).
  • Latent generative models: MeanFlow leverages powerful representation autoencoders for semantically meaningful generation, significantly reducing computational requirements compared to SD-VAE pipelines (Hu et al., 17 Nov 2025).
  • Offline RL policy learning: Residual MeanFlow reformulation (one-step Q-learning) enables fast, expressive, stable policy deployment in both tabular and continuous-control settings, outperforming both Gaussian and compositionally distilled flow policies (Wang et al., 17 Nov 2025).

Curriculum-based, modular, and decoupled MeanFlow strategies continue to generalize as foundational techniques for scalable, efficient, and robust generative modeling.

7. Limitations, Pitfalls, and Best Practices

MeanFlow training can exhibit instability due to higher-order gradients unless proper gradient modulation, curriculum scheduling, and architectural choices are applied (You et al., 24 Aug 2025, Kim et al., 24 Nov 2025). Key pitfalls:

  • Jumping directly to full gradient backpropagation (λ=1\lambda=1) can cause oscillatory, divergent losses.
  • Small λ\lambda throughout restricts expressivity, leading to poor FID.
  • Instability with very small interval gaps (tr)(t-r); can be mitigated by thresholding trt-r and appropriately reweighting loss terms (You et al., 24 Aug 2025).
  • In latent-space MF, naive initialization without trajectory-aware warm-start (e.g., CMT) leads to severe gradient explosion (Hu et al., 17 Nov 2025).

Best practices include:

  • Monitoring both the regression loss and gradient norms throughout curriculum ramp-up.
  • Applying distance truncation and adaptive loss scaling as needed.
  • Optionally fine-tuning on long-intervals (r=0r=0, t=1t=1) late in training to polish one-step accuracy.
  • Using adaptive schedules for curriculum and guidance mixing.

MeanFlow pretraining provides a mathematically principled, empirically robust approach for one-step generative modeling, acceleration of flow-based architectures, and efficient reinforcement learning policy generation—with modular, scalable strategies now established across the generative modeling landscape (You et al., 24 Aug 2025, Kim et al., 24 Nov 2025, Lee et al., 28 Oct 2025, Zhang et al., 28 Nov 2025, Hu et al., 17 Nov 2025, Wang et al., 17 Nov 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MeanFlow Pretraining.