Modular MeanFlow: Unified One-Step Modeling

Updated 26 November 2025

Modular MeanFlow (MMF) is a framework that efficiently generates high-quality data samples in one step via time-averaged velocity regression.
It introduces a tunable gradient modulation mechanism with a curriculum warmup to balance training stability and model expressiveness.
Empirical results show state-of-the-art performance in image synthesis, low-data regimes, and out-of-distribution scenarios.

Modular MeanFlow (MMF) is a unifying framework for stable and scalable one-step generative modeling, developed to efficiently generate high-quality data samples via direct mapping in a single function evaluation. MMF generalizes and interpolates between flow-matching and consistency-based models by introducing a principled family of regression losses built upon time-averaged velocity fields. Central to its design are a differential identity linking instantaneous and averaged velocities, a tunable gradient modulation mechanism, and a curriculum-style warmup schedule for training stability and expressiveness. Empirically, MMF achieves state-of-the-art performance across image synthesis, low-data, out-of-distribution (OOD), and trajectory modeling tasks, while circumventing the computational burden of higher-order derivatives (You et al., 24 Aug 2025).

1. Theoretical Framework

MMF builds on the continuous-time generative model defined by the ordinary differential equation (ODE):

$\frac{dx_t}{dt} = v(x_t, t), \qquad x_1 \sim p_\text{prior}, \;\; x_0 \sim p_\text{data}$

where $v(x_t, t)$ denotes the instantaneous velocity field parameterizing the mapping from $p_\text{prior}$ (usually a tractable distribution) to $p_\text{data}$ . MMF introduces the time-averaged velocity field over the interval $[r, t]$ :

$u(x_t, r, t) := \frac{1}{t - r} \int_{r}^{t} v(x_\tau, \tau)\,d\tau$

With Lipschitz assumptions on $v$ , the averaged velocity recovers the instantaneous field as $t \to r$ :

$\lim_{t \to r} u(x_t, r, t) = v(x_r, r)$

A key identity underpins MMF:

$v(x_t, t) = u(x_t, r, t) + (t - r) \frac{d}{dt} u(x_t, r, t)$

where $\frac{d}{dt}u = \partial_t u + (\nabla_x u)\,v(x_t, t)$ . This relation enables the regression of averaged velocities and their time derivatives to approximate the model's functional path, decoupling expressiveness from the risk of instability intrinsic to higher-order supervision.

2. Modular Loss Construction and Gradient Modulation

MMF defines a spectrum of regression losses parametrized both by velocity averaging interval and a scalar $\lambda \in [0,1]$ controlling gradient flow. The “full” regression loss is given by:

$\mathcal{L}_\text{full} = \mathbb{E}_{x_0, x_1, r < t} \left\| u_\theta(x_t, r, t) + (t-r)\Big(\partial_t u_\theta + \nabla_x u_\theta \cdot u_\theta\Big) - \frac{x_1 - x_0}{t - r} \right\|^2$

To trade off training stability and functional expressiveness, a partial stop-gradient operator is introduced:

$\mathrm{SG}_\lambda[z] := \lambda z + (1-\lambda)\,\mathrm{stopgrad}(z)$

The MMF loss then generalizes as:

$\mathcal{L}_\lambda = \mathbb{E}_{x_0, x_1, r < t} \left\| u_\theta(x_t, r, t) + (t-r)\,\mathrm{SG}_\lambda\left[\partial_t u_\theta + \nabla_x u_\theta \cdot \frac{x_1 - x_0}{t-r}\right] - \frac{x_1 - x_0}{t - r} \right\|^2$

Here, $\lambda=1$ fully propagates gradients (maximum expressiveness but possible instability), $\lambda=0$ detaches Jacobian-vector products (maximum stability), and intermediate $\lambda$ governs a stability-expressiveness continuum. Explicitly blocking gradient flow through higher-order terms prevents gradient explosions and training oscillations.

3. Curriculum Warmup and Training Protocol

MMF adopts a curriculum for $\lambda$ :

$\lambda(t_\mathrm{train}) = \min\left(1, \frac{t_\mathrm{train}}{T_\mathrm{warmup}}\right), \qquad T_\mathrm{warmup} \approx 10\%\;\text{of total steps}$

In early training ( $\lambda \approx 0$ ), MMF behaves as a consistency or flow-matching model, yielding high stability by restricting second-order signal propagation. As $\lambda \to 1$ , expressive gradients are introduced, allowing the model to capture richer curvature and achieve lower asymptotic loss. Empirically, this curriculum schedule yields both rapid convergence and low variance in training.

A typical MMF training protocol:

Given: network u_θ, total steps N, warmup T_warmup
for step = 1 to N do
    sample x₀ ∼ p_data,  x₁ ∼ p_prior
    sample r∈[0,1), t∈(r,1]
    α ← (t - r)/(1 - r)
    xₜ ← (1-α) x₀ + α x₁
    λ ← min(1, step / T_warmup)
    # compute loss ℒ_λ using SG_λ on the JVP term
    L ← ‖ u_θ(xₜ,r,t)
           + (t-r)·SG_λ[∂ₜu + ∇ₓu·((x₁-x₀)/(t-r))]
           - (x₁ - x₀)/(t-r) ‖²
    θ ← θ - AdamStep(∇_θ L)
end for

Sampling proceeds in one step: $\hat{x}_0 = x_1 - u_\theta(x_1, r=0, t=1)$ .

4. Connections to Prior Methods

MMF subsumes prior consistency and flow-matching models as special cases:

Consistency Models: Fixing $(r, t) \equiv (0, 1)$ and $\lambda=0$ recovers the fixed-time consistency loss $\|u_\theta(x_t, 0, 1) - (x_1 - x_0)\|^2$ .
Flow Matching: The instantaneous limit $t \to r$ , $u_\theta \to v_\theta$ , and $(t-r)^{-1}(x_1 - x_0) \to v(x_r, r)$ recovers the flow-matching loss $\|v_\theta(x_r, r) - v_\mathrm{true}(x_r, r)\|^2$ .
Gradient Efficiency: Applying stop-gradient to the Jacobian-vector term ensures that backward computation never traverses $\nabla^2 u_\theta$ , eliminating $\mathcal{O}(d^2)$ cost and Hessian-vector product overhead.

This unification allows MMF to inherit the interpretability and theoretical properties of both frameworks, while providing a tunable control for interpolation between them.

5. Empirical Results and Model Analysis

MMF's empirical evaluation focuses on image synthesis, robustness, and path modeling:

Model	FID (↓)	1-step MSE (↓)	LPIPS (↓)	Inference Time (s) (↓)
MeanFlow (full)	3.91	0.087	0.132	0.031
MeanFlow (stop-grad)	4.27	0.095	0.156	0.024
MMF (λ=0)	4.19	0.093	0.148	0.023
MMF (λ=0.5)	3.78	0.084	0.120	0.026
MMF (λ=1)	3.62	0.080	0.109	0.034
MMF (curriculum)	3.41	0.076	0.097	0.025

On CIFAR-10 and ImageNet-64, curriculum MMF achieves the lowest FID, lowest 1-step MSE, and the highest diversity (LPIPS), matching or exceeding the efficiency of prior mean flow and consistency baselines. Few-shot and OOD experiments demonstrate that curriculum MMF retains low FID even with as little as 1% of CIFAR-10 data and achieves 10–20% lower FID in OOD settings (SVHN, STL-10, CIFAR-C) compared to baselines. In ODE-fitting and 2D control tasks, curriculum MMF yields smooth, accurate paths, outperforming noisy full-gradient and oversmoothed stop-grad alternatives.

Path deviation is formalized as:

$\mathcal{D}_\mathrm{path} = \mathbb{E}\big\| (s - r) u(x_s, r, s) + (t - s) u(x_t, s, t) - (t - r) u(x_t, r, t) \big\|$

Curriculum MMF achieves the lowest $\mathcal{D}_\mathrm{path}$ , supporting latent interpolation smoothness.

6. Ablations and Practicalities

Extensive ablations reveal that:

Varying $\lambda$ : $\lambda=0$ yields maximum stability but higher FID (underfitting curvature). $\lambda=1$ is most expressive but unstable (loss oscillations). $\lambda=0.5$ provides some smoothing but with late-stage variance. Curriculum $\lambda(t)$ combines low early variance and best final performance.
Curriculum Horizon: Short warmup (small $T_\mathrm{warmup}$ ) induces early instability; long warmup is too conservative with slower convergence. Optimal $T_\mathrm{warmup} \approx 10$ – $15\%$ of total steps.
Compute: Forward-mode autodiff for the Jacobian-vector-product yields ~15% overhead; with stopgrad, no backward is needed through this term.

The standard MMF implementation utilizes a UNet architecture with sinusoidal time embeddings, Adam optimizer (learning rate $1 \times 10^{-4}$ , batch size 128, cosine decay), and a curriculum warmup over 100k steps.

7. Significance, Limitations, and Outlook

MMF provides a theoretically grounded, computationally efficient, and practically robust approach for one-step generative modeling. By enabling a tunable spectrum between expressiveness and stability—mediated by gradient modulation and curriculum scheduling—it addresses the instability and inefficiency intrinsic to prior higher-order methods. MMF’s empirical results demonstrate high generalization under low-data and out-of-distribution regimes and applicability beyond image synthesis to trajectory modeling. A plausible implication is that MMF may be extensible to other domains requiring stable, one-shot sampling of complex data distributions via learnable ODE flows (You et al., 24 Aug 2025).

PDF Markdown Chat (Pro)

References (1)

Modular MeanFlow: Towards Stable and Scalable One-Step Generative Modeling (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Modular MeanFlow (MMF).