Conditional Mean Faithful Generation (cMFG)
- Conditional Mean Faithful Generation (cMFG) is a framework that refines conditional flow matching by integrating mean-flow regression and embedded classifier-free guidance into a unified generative model.
- The method employs a flux-style latent Transformer and a two-stage curriculum with flow mix-up to balance local detail and global structure for accurate conditional synthesis.
- Empirical evaluations on AudioCaps demonstrate that cMFG achieves up to a 100× speed improvement and significant metric gains, reducing FAD by 23% and enhancing real-time text-to-audio generation.
Conditional Mean Faithful Generation (cMFG) is an overview paradigm introduced to address the inefficiencies of traditional stochastic denoising strategies in conditional generative models. Originally proposed within the MeanAudio framework for text-to-audio generation, cMFG integrates mean-flow regression, conditional flow matching, and classifier-free guidance directly into the model objective and architecture. This results in a single-step or efficient multi-step generator that offers high semantic faithfulness to conditional input while providing significant gains in inference speed compared to diffusion-based or standard flow-matching approaches (Li et al., 8 Aug 2025).
1. Theoretical Foundations and Formal Objectives
Conditional Mean Faithful Generation extends deterministic Flow Matching (FM) by generalizing the target from the instantaneous velocity field to the mean velocity over arbitrarily sized time intervals. Let denote latent representations and denote noise. The linear path between data and noise is parametrized as
with instantaneous velocity
Traditional Conditional Flow Matching trains a model to minimize
Mean Flow–guided training instead regresses the average velocity field over intervals: and the loss is
where denotes stop-gradient and backpropagate through
0
When 1, this reduces to the original FM loss, ensuring compatibility and stable interpolation between learning regimes.
2. Architecture: Flux-Style Latent Transformer
MeanAudio’s implementation of cMFG leverages a flux-style Transformer operating in the VAE latent space. The model 2 consists of:
- 3 Multi-modal MMDiT blocks for joint audio/text attention.
- 4 Audio-only DiT blocks, each with hidden dimension 448, totaling 120M parameters.
- Audio tokens (mel-spectrogram latents) are handled via ConvMLP (1D convolutions, kernel size 3) for local temporal feature extraction.
- Text features are provided by FLAN-T5 token embeddings and injected through cross-attention in MMDiT, supported by CLAP vectors projected into the time embedding space.
- Adaptive LayerNorm (AdaLN) and rotary positional embeddings (RoPE) are employed, with RMSNorm for stabilizing attention.
This architecture allows efficient, semantically aligned fusion of audio and text modalities, critical for conditional mean flow learning.
3. Integrated Classifier-Free Guidance (CFG)
Conventional CFG doubles runtime during inference by requiring both conditional and unconditional model evaluations. In cMFG, CFG is incorporated into the training target, eliminating inference overhead. The guided instantaneous velocity is given by: 5 with 6 as the conditioning text, 7 indicating conditioning dropout (10% rate), and weights 8, 9. The effective CFG scale is 0. The mean-flow regression target is updated to
1
The training loss regresses 2 directly to this integrated CFG target, so only a single network evaluation is necessary for guided sampling.
4. Training Regime: Instantaneous-to-Mean Curriculum with Flow Mix-Up
Direct training with mean velocity targets is unstable, necessitating a staged curriculum:
- Pre-training (Stage I): On large weakly labeled data (WavCaps ∪ AudioCaps ∪ Clotho), with 3, minimizing 4. This encourages accurate learning of local, instantaneous velocity.
- Fine-tuning (Stage II): On high-quality data (AudioCaps), 5 are sampled from a log-normal prior. With probability 6, 7; otherwise, 8. The objective becomes
9
The flow "mix-up" ensures fine-grained denoising skill is preserved while enabling modeling of long-step, global mean displacements.
5. Sampling Procedure and Generation Dynamics
cMFG’s generative sampling exploits the learned mean velocity for direct trajectory integration from the noise prior. For general 0-step integration: 6 The one-step (1) variant becomes: 2 yielding a direct map from a Gaussian to the data manifold, guided by the conditional input.
6. Performance: Faithfulness and Speed
In empirical evaluations, cMFG via MeanAudio achieves strong trade-offs between speed and fidelity. On the AudioCaps test set, single-step generation yields:
| Metric | Value | Change vs Prior |
|---|---|---|
| FAD | 1.77 | −23% |
| FD | 15.4 | −22% |
| KL | 1.31 | −8% |
| IS | 9.78 | +7% |
| CLAP | 0.292 | +9% |
Single-step real time factor is 3 (RTX 3090), a 4 improvement over diffusion-based TTA systems (52.5). Multi-step generation further improves fidelity, suggesting robustness of the mean flow modeling (Li et al., 8 Aug 2025).
cMFG's text alignment is enhanced by embedding classical classifier-free guidance into the regression target, and two-stage curriculum learning ensures both global and local structure are maintained throughout training and generation.
7. Significance and Outlook
Conditional Mean Faithful Generation unifies conditional flow matching, mean-velocity regression, integrated classifier-free guidance, and multi-phase curriculum strategies into a coherent framework for rapid, conditional generative modeling. By eliminating redundant inference passes and harnessing mean velocity fields, cMFG achieves both inferential efficiency and semantic faithfulness. The demonstrated real time acceleration and consistent performance gains in text-to-audio suggest applicability in real-time synthesis tasks and inspire potential adaptations in other conditional sequence generation domains (Li et al., 8 Aug 2025).