Modular MeanFlow: Unified Generative Modeling
- Modular MeanFlow is a unified framework that employs time-averaged velocity fields to achieve efficient, one-step generative modeling across diverse applications.
- It generalizes flow-matching and consistency models by integrating noise-to-data transformations with residual reformulations for scalable architecture design.
- Empirical findings demonstrate improved computational efficiency and stability, with competitive performance in RL tasks, image synthesis, and neural architecture compression.
Modular MeanFlow is a unified framework for efficient, expressive, and stable one-step generative modeling, grounded in the theory of time-averaged velocity fields. It generalizes prior flow-matching and consistency models, allows scalable architectures, and is applicable in offline reinforcement learning, generative image modeling, and neural architecture compression. At its core, Modular MeanFlow integrates noise-to-data transformations and the averaging of velocity fields to realize single-pass sample generation, markedly improving the computational efficiency and stability of deep generative models.
1. Mathematical Foundations and Policy Definition
Modular MeanFlow models the transformation from a simple prior distribution (e.g., Gaussian noise) to complex target distributions (data or actions) in a single function evaluation. The foundational object is the time-averaged velocity field
where is the instantaneous velocity field parameterizing ODE trajectories between noise and data. The one-step generative policy is defined as
for image/data generation (You et al., 24 Aug 2025), and in RL,
where is the modular policy network (Wang et al., 17 Nov 2025). For architectures like ResNet, a MeanFlow module replaces multi-step residual blocks with a single transformation:
where is the stage-aligned feature representation (Sun et al., 16 Nov 2025).
2. Residual Reformulation and Differential Identities
A central innovation in Modular MeanFlow is the use of residual reformulations to collapse multi-stage processes (iterative flows or two-stage models) into a single residual network. The differential identity linking instantaneous (local) and average (global) velocity is
with total derivative
For RL, substituting yields a regression target for :
enforced by minimizing
circumventing direct estimation of (Wang et al., 17 Nov 2025).
For generative modeling, Modular MeanFlow introduces a gradient modulation mechanism:
where
with progressively increased using a curriculum schedule, balancing stability and expressiveness (You et al., 24 Aug 2025).
3. Algorithmic Procedures and Pseudocode
Modular MeanFlow algorithms consist of a phased training loop, modular architectural instantiation, and loss/curriculum strategies:
RL Training Loop (One-Step Policy with Q-Learning) (Wang et al., 17 Nov 2025):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
Initialize critic Q_ϕ, target critic Q̄_ϕ, policy network g_θ while not converged: (s,a,r,s') ← UniformSample(𝒟) for k=1..K: e_k ∼ N(0,I) a'_k = g_θ(e_k, b=0, t=1 | s') a' ← argmax_{k} Q̄_ϕ(s', a'_k) TD_target = r + γ·Q̄_ϕ(s', a') ϕ ← ϕ − α_Q ∇_ϕ [Q_ϕ(s,a) − TD_target]^2 t∼U(0,1), b∼U(0,t) e ∼ N(0,I), a_t=(1−t)a + t e v=e−a (g, dgdt) = jvp(g_θ, (s,a_t,b,t), (s,v,0,1)) form g_tgt = a_t+(t−b−1)v −(t−b)dgdt L_MFI = ‖g−stopgrad(g_tgt)‖^2 e₀ ∼ N(0,I); a^π = g_θ(e₀,0,1|s) L_Qπ = −Q_ϕ(s, a^π) θ ← θ − α_θ ∇_θ [L_Qπ + α·L_MFI] ϕ̄ ← τϕ + (1−τ)ϕ̄ end while |
Image Modeling and Meta-Architecture Pseudocode (You et al., 24 Aug 2025, Sun et al., 16 Nov 2025):
1 2 3 4 5 6 7 8 |
def MeanFlowModule(X_in, Conv_align, u_theta, t=(0,1)): Z_align = ReLU(BatchNorm(Conv_align(X_in))) B, C, H, W = Z_align.shape Z = Z_align.view(B*H*W, C) U = u_theta(Z, t) Z_mapped = Z - U X_out = Z_mapped.view(B, C, H, W) return X_out |
Curriculum scheduling for enables stable gradient propagation during training (You et al., 24 Aug 2025).
4. Variant-Level Design and Modularity
Modular MeanFlow generalizes prior architectures by allowing flexible instantiation of key components: noise generator, fixed module (), and residual learner.
- Residual forms are given by . Empirical studies confirm that yields stable and expressive decoders; naive choices ( or ) cause mode collapse and out-of-bounds action rates (Wang et al., 17 Nov 2025).
- This modularity enables drop-in replacement or compression of multi-block stages (e.g., ResNet, DenseNet, Transformer encoder groups) by single-step MeanFlow modules (Sun et al., 16 Nov 2025).
- Compression–expansion strategies retain discriminative capacity (by selectively incubating critical early blocks) while achieving parameter efficiency.
5. Architectural Configurations and Hyperparameters
Key settings for Modular MeanFlow in various domains include:
| Feature | RL (MeanFlow Policy) | Vision (MFI-ResNet) | Generative Modeling (MMF) |
|---|---|---|---|
| Policy Network | DiT-style Transformer (depth=3, hidden=256, heads=2) | 2-layer MLP with GeLU, time embedding | UNet with residual blocks |
| Critic Network | 4-layer MLP, width 512 | N/A | N/A |
| Learning Rates | , cosine warmup | N/A | , Adam |
| α (behavior cloning) | Adaptive, | N/A | N/A |
| JVP Usage | Forward-mode autodiff | N/A | λ>0 only |
| Curriculum Schedule | N/A | N/A | Linear λ warmup |
| Data Batch Sizes | N/A | N/A | 128 |
| Noise Dimension | N/A | Data dimension |
- For meta-architectures, 1×1 convolution, batch normalization, and ReLU are used for dimensional alignment before velocity prediction (Sun et al., 16 Nov 2025).
- Warmup horizons and curriculum schedules are employed for stability in generative modeling (You et al., 24 Aug 2025).
6. Empirical Findings and Analysis
Modular MeanFlow achieves strong empirical performance and stability across domains.
- In RL, Modular MeanFlow matches or outperforms ten baselines (Gaussian, diffusion, flow) on 68/73 tasks, e.g., OGBench antmaze-large-singletask (81% vs FQL’s 79%), humanoidmaze-large (20% vs FQL’s 4%), puzzle-3x3 (66% vs FQL’s 30%). Offline→online fine-tuning achieves 82→100% on humanoidmaze, 62→100% on antsoccer (Wang et al., 17 Nov 2025).
- In ResNet compression, MFI-ResNet-50 reduces parameters by 46.3%, with marginal increase in accuracy (CIFAR-10: 95.56% vs 95.34%; CIFAR-100: 75.93% vs 75.80%) (Sun et al., 16 Nov 2025).
- For image synthesis, curriculum-trained MMF attains the lowest FID (3.41) and 1-step MSE, with robust convergence on CIFAR-10 and strong generalization in low-data and out-of-distribution regimes. Inference cost remains low (∼0.02–0.03 s/image) (You et al., 24 Aug 2025).
Ablation studies demonstrate:
- Nonlinearities in the residual module and time step discretization strategies are crucial for stability.
- Naive residual forms, or improper weighting in behavior cloning, cause collapse or unstable Q-targets.
- Curriculum-style gradient modulation improves stability and sample quality over fixed (You et al., 24 Aug 2025).
7. Significance, Limitations, and Future Directions
Modular MeanFlow provides a principled, scalable approach to one-step generative modeling and policy learning. Its main advantages are:
- Unified framework generalizing consistency models and flow-matching approaches.
- Modular design allows flexible adaptation to various architectures and tasks.
- Single-evaluation generative sampling reduces computation time.
- Gradient modulation and curriculum warmup balance learning stability and expressiveness.
Limitations include the overhead of Jacobian-vector products (JVPs) for full gradient regimes and the heuristic nature of curriculum schedules. Scaling to higher data resolutions and developing adaptive scheduling or further theoretical analysis of generalization error remains open (You et al., 24 Aug 2025). Extension to hybrid one-plus-few-step schemes may offer further speed-fidelity trade-offs.
In summary, Modular MeanFlow delivers state-of-the-art results in generative modeling, offline RL, and neural architecture optimization via unified residual velocity fields and modular loss/objective design (You et al., 24 Aug 2025, Wang et al., 17 Nov 2025, Sun et al., 16 Nov 2025).