Modular MeanFlow: Unified Generative Modeling

Updated 3 December 2025

Modular MeanFlow is a unified framework that employs time-averaged velocity fields to achieve efficient, one-step generative modeling across diverse applications.
It generalizes flow-matching and consistency models by integrating noise-to-data transformations with residual reformulations for scalable architecture design.
Empirical findings demonstrate improved computational efficiency and stability, with competitive performance in RL tasks, image synthesis, and neural architecture compression.

Modular MeanFlow is a unified framework for efficient, expressive, and stable one-step generative modeling, grounded in the theory of time-averaged velocity fields. It generalizes prior flow-matching and consistency models, allows scalable architectures, and is applicable in offline reinforcement learning, generative image modeling, and neural architecture compression. At its core, Modular MeanFlow integrates noise-to-data transformations and the averaging of velocity fields to realize single-pass sample generation, markedly improving the computational efficiency and stability of deep generative models.

1. Mathematical Foundations and Policy Definition

Modular MeanFlow models the transformation from a simple prior distribution (e.g., Gaussian noise) to complex target distributions (data or actions) in a single function evaluation. The foundational object is the time-averaged velocity field

$u(x_t, r, t) := \frac{1}{t - r} \int_r^t v(x_\tau, \tau) d\tau,$

where $v(x_\tau, \tau)$ is the instantaneous velocity field parameterizing ODE trajectories between noise and data. The one-step generative policy is defined as

$x_0 \approx x_1 - u(x_1, 0, 1),$

for image/data generation (You et al., 24 Aug 2025), and in RL,

$a = g_\theta(e, 0, 1) = e - u_\theta(e, 0, 1), \quad e \sim \mathcal{N}(0, I),$

where $g_\theta$ is the modular policy network (Wang et al., 17 Nov 2025). For architectures like ResNet, a MeanFlow module replaces multi-step residual blocks with a single transformation:

$Z_{mapped} = Z_{align} - u_\theta(Z_{align}, 0, 1),$

where $Z_{align}$ is the stage-aligned feature representation (Sun et al., 16 Nov 2025).

2. Residual Reformulation and Differential Identities

A central innovation in Modular MeanFlow is the use of residual reformulations to collapse multi-stage processes (iterative flows or two-stage models) into a single residual network. The differential identity linking instantaneous (local) and average (global) velocity is

$v(x_t, t) = u(x_t, r, t) + (t - r) \frac{d}{dt} u(x_t, r, t),$

with total derivative

$\frac{d}{dt} u = \partial_t u + (\nabla_x u) v(x_t, t).$

For RL, substituting $u(a_t, b, t) = a_t - g_\theta(a_t, b, t)$ yields a regression target for $g_\theta$ :

$g_{tgt} = a_t + (t-b-1)v(a_t, t) - (t-b) [v(a_t, t) \partial_{a_t} g + \partial_t g],$

enforced by minimizing

$\mathcal{L}_{MFI}(\theta) = \mathbb{E}_{(s, a, e, b, t)} \| g_\theta(a_t, b, t) - \mathrm{stopgrad}(g_{tgt}) \|^2,$

circumventing direct estimation of $v$ (Wang et al., 17 Nov 2025).

For generative modeling, Modular MeanFlow introduces a gradient modulation mechanism:

$\mathcal{L}_\lambda = \mathbb{E}_{x_0, x_1, r, t} \big\| u_\theta(x_t, r, t) + (t-r) \mathrm{SG}_\lambda[\partial_t u_\theta + \nabla_x u_\theta \cdot \tfrac{x_1-x_0}{t-r}] - \tfrac{x_1-x_0}{t-r} \big\|_2^2,$

where

$\mathrm{SG}_\lambda[z] = \lambda z + (1 - \lambda) \mathrm{stopgrad}(z),$

with $\lambda$ progressively increased using a curriculum schedule, balancing stability and expressiveness (You et al., 24 Aug 2025).

3. Algorithmic Procedures and Pseudocode

Modular MeanFlow algorithms consist of a phased training loop, modular architectural instantiation, and loss/curriculum strategies:

RL Training Loop (One-Step Policy with Q-Learning) (Wang et al., 17 Nov 2025):

Initialize critic Q_ϕ, target critic Q̄_ϕ, policy network g_θ
while not converged:
    (s,a,r,s') ← UniformSample(𝒟)
    for k=1..K: 
        e_k ∼ N(0,I) 
        a'_k = g_θ(e_k, b=0, t=1 | s') 
    a' ← argmax_{k} Q̄_ϕ(s', a'_k) 
    TD_target = r + γ·Q̄_ϕ(s', a') 
    ϕ ← ϕ − α_Q ∇_ϕ [Q_ϕ(s,a) − TD_target]^2
    t∼U(0,1), b∼U(0,t)
    e ∼ N(0,I), a_t=(1−t)a + t e
    v=e−a
    (g, dgdt) = jvp(g_θ, (s,a_t,b,t), (s,v,0,1))
    form g_tgt = a_t+(t−b−1)v −(t−b)dgdt
    L_MFI = ‖g−stopgrad(g_tgt)‖^2
    e₀ ∼ N(0,I); a^π = g_θ(e₀,0,1|s)
    L_Qπ = −Q_ϕ(s, a^π)
    θ ← θ − α_θ ∇_θ [L_Qπ + α·L_MFI]
    ϕ̄ ← τϕ + (1−τ)ϕ̄
end while

Image Modeling and Meta-Architecture Pseudocode (You et al., 24 Aug 2025, Sun et al., 16 Nov 2025):

def MeanFlowModule(X_in, Conv_align, u_theta, t=(0,1)):
    Z_align = ReLU(BatchNorm(Conv_align(X_in)))
    B, C, H, W = Z_align.shape
    Z = Z_align.view(B*H*W, C)
    U = u_theta(Z, t)
    Z_mapped = Z - U
    X_out = Z_mapped.view(B, C, H, W)
    return X_out

Curriculum scheduling for $\lambda$ enables stable gradient propagation during training (You et al., 24 Aug 2025).

4. Variant-Level Design and Modularity

Modular MeanFlow generalizes prior architectures by allowing flexible instantiation of key components: noise generator, fixed module ( $\phi$ ), and residual learner.

Residual forms are given by $g_\theta(a_t, b, t) = \phi(a_t, b, t) - u_\theta(a_t, b, t)$ . Empirical studies confirm that $\phi(a_t, b, t) = a_t$ yields stable and expressive decoders; naive choices ( $\phi = e$ or $\phi = et$ ) cause mode collapse and out-of-bounds action rates (Wang et al., 17 Nov 2025).
This modularity enables drop-in replacement or compression of multi-block stages (e.g., ResNet, DenseNet, Transformer encoder groups) by single-step MeanFlow modules (Sun et al., 16 Nov 2025).
Compression–expansion strategies retain discriminative capacity (by selectively incubating critical early blocks) while achieving parameter efficiency.

5. Architectural Configurations and Hyperparameters

Key settings for Modular MeanFlow in various domains include:

Feature	RL (MeanFlow Policy)	Vision (MFI-ResNet)	Generative Modeling (MMF)
Policy Network	DiT-style Transformer (depth=3, hidden=256, heads=2)	2-layer MLP with GeLU, time embedding	UNet with residual blocks
Critic Network	4-layer MLP, width 512	N/A	N/A
Learning Rates	$1\times 10^{-4}$ , cosine warmup	N/A	$1\times 10^{-4}$ , Adam
α (behavior cloning)	Adaptive, $10\sim 10^4$	N/A	N/A
JVP Usage	Forward-mode autodiff	N/A	λ>0 only
Curriculum Schedule	N/A	N/A	Linear λ warmup
Data Batch Sizes	N/A	N/A	128
Noise Dimension	$\|\mathcal{A}\|$	N/A	Data dimension

For meta-architectures, 1×1 convolution, batch normalization, and ReLU are used for dimensional alignment before velocity prediction (Sun et al., 16 Nov 2025).
Warmup horizons and curriculum schedules are employed for stability in generative modeling (You et al., 24 Aug 2025).

6. Empirical Findings and Analysis

Modular MeanFlow achieves strong empirical performance and stability across domains.

In RL, Modular MeanFlow matches or outperforms ten baselines (Gaussian, diffusion, flow) on 68/73 tasks, e.g., OGBench antmaze-large-singletask (81% vs FQL’s 79%), humanoidmaze-large (20% vs FQL’s 4%), puzzle-3x3 (66% vs FQL’s 30%). Offline→online fine-tuning achieves 82→100% on humanoidmaze, 62→100% on antsoccer (Wang et al., 17 Nov 2025).
In ResNet compression, MFI-ResNet-50 reduces parameters by 46.3%, with marginal increase in accuracy (CIFAR-10: 95.56% vs 95.34%; CIFAR-100: 75.93% vs 75.80%) (Sun et al., 16 Nov 2025).
For image synthesis, curriculum-trained MMF attains the lowest FID (3.41) and 1-step MSE, with robust convergence on CIFAR-10 and strong generalization in low-data and out-of-distribution regimes. Inference cost remains low (∼0.02–0.03 s/image) (You et al., 24 Aug 2025).

Ablation studies demonstrate:

Nonlinearities in the residual module and time step discretization strategies are crucial for stability.
Naive residual forms, or improper weighting in behavior cloning, cause collapse or unstable Q-targets.
Curriculum-style gradient modulation improves stability and sample quality over fixed $\lambda$ (You et al., 24 Aug 2025).

7. Significance, Limitations, and Future Directions

Modular MeanFlow provides a principled, scalable approach to one-step generative modeling and policy learning. Its main advantages are:

Unified framework generalizing consistency models and flow-matching approaches.
Modular design allows flexible adaptation to various architectures and tasks.
Single-evaluation generative sampling reduces computation time.
Gradient modulation and curriculum warmup balance learning stability and expressiveness.

Limitations include the overhead of Jacobian-vector products (JVPs) for full gradient regimes and the heuristic nature of curriculum schedules. Scaling to higher data resolutions and developing adaptive scheduling or further theoretical analysis of generalization error remains open (You et al., 24 Aug 2025). Extension to hybrid one-plus-few-step schemes may offer further speed-fidelity trade-offs.

In summary, Modular MeanFlow delivers state-of-the-art results in generative modeling, offline RL, and neural architecture optimization via unified residual velocity fields and modular loss/objective design (You et al., 24 Aug 2025, Wang et al., 17 Nov 2025, Sun et al., 16 Nov 2025).