Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 96 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 24 tok/s
GPT-5 High 36 tok/s Pro
GPT-4o 102 tok/s
GPT OSS 120B 434 tok/s Pro
Kimi K2 198 tok/s Pro
2000 character limit reached

MeanAudio: Fast & Faithful TTA

Updated 12 August 2025
  • MeanAudio is a text-to-audio framework that uses a Flux-style latent transformer and mean flow regression to directly map noisy latents to target audio in a single step.
  • It integrates classifier-free guidance with a dual conditioning strategy and employs an instantaneous-to-mean curriculum with flow field mix-up to enhance stability and synthesis quality.
  • The approach achieves state-of-the-art performance, with a real-time factor of 0.013 and improved quality metrics compared to multi-step diffusion-based systems.

MeanAudio refers to a class of models and methodologies designed for fast and faithful text-to-audio (TTA) generation by leveraging so-called "Mean Flows" within a Flux-style transformer architecture operating in the audio latent space. The central innovation is to regress the average velocity field during training, enabling the model to directly map from the starting latent to the endpoint, thus supporting single-step generation and dramatically decreasing inference latency while preserving synthesis quality.

1. Model Architecture

MeanAudio utilizes a Flux-style latent transformer, operating on latent spaces produced by a variational autoencoder (VAE) trained on mel spectrograms. The architecture consists of two major block types:

  • Multi-modal transformer blocks (MMDiT; N₁ layers): Jointly process audio and text signal inputs, incorporating token embeddings from an instruction-tuned LLM (FLAN-T5) for detailed linguistic conditioning.
  • Single-modal transformer blocks (DiT; N₂ layers): Specialize for audio-only processing.

Within the audio branch, ConvMLPs (using 1D convolutions instead of conventional MLP layers) facilitate local temporal pattern modeling. Rotary positional embeddings (RoPE) are deployed for finer relative position encoding; RMSNorm with learnable scales is used for stable gradient flow. Text conditioning is implemented via a dual strategy: fine-grained language features from FLAN-T5 and global semantic audio grounding from CLAP (Contrastive Language-Audio Pretraining), which is injected into the network via projected timestep embeddings and adaptive layer normalization (AdaLN).

Critically, the MeanAudio transformer regresses an average velocity field in the latent space—mapping a noisy latent (sampled from a standard prior) directly to the target latent associated with the desired audio, in a single function evaluation.

2. MeanFlow Training Objective

MeanAudio's principal methodological distinction is the MeanFlow objective:

  • Instantaneous velocity field: vt=ϵxv_t = \epsilon - x, where xx is the clean latent, and ϵ\epsilon is Gaussian noise.
  • Average velocity field over interval [r,t][r, t]:

u(xt,r,t)1trrtv(xτ,τ)dτu(x_t, r, t) \triangleq \frac{1}{t - r} \int_r^t v(x_\tau, \tau) d\tau

  • Mean Flow Identity: The model enforces

u(xt,r,t)=vt(tr)ddtu(xt,r,t)u(x_t, r, t) = v_t - (t - r) \frac{d}{dt} u(x_t, r, t)

  • MeanFlow loss function:

LMF=Et,r,x,ϵfθ(xt,r,t)sg(utarget)2\mathcal{L}_{MF} = \mathbb{E}_{t, r, x, \epsilon} \left\| f_\theta(x_t, r, t) - \text{sg}(u_\text{target}) \right\|^2

where

utarget=vt(tr)ddtfθ(xt,r,t)u_\text{target} = v_t - (t - r) \frac{d}{dt} f_\theta(x_t, r, t)

The derivative ddtfθ(xt,r,t)\frac{d}{dt} f_\theta(x_t, r, t) is computed via a Jacobian-Vector Product (JVP), decomposed as vtxfθ+tfθv_t \partial_x f_\theta + \partial_t f_\theta. The model is thus incentivized to learn a representation that can "skip" along the flow trajectory, enabling highly efficient deterministic one-shot synthesis.

3. Classifier-Free Guidance Integration

Classifier-Free Guidance (CFG) is directly built into the training target for MeanAudio, circumventing the need for extra model evaluations during inference:

  • Guided instantaneous velocity field:

vtcfg=ωvt+κft(xt,t,t)+(1ωκ)ft(xt,t,tC)v_t^{cfg} = \omega \cdot v_t + \kappa \cdot f_t(x_t, t, t|\emptyset) + (1-\omega-\kappa) \cdot f_t(x_t, t, t|C)

where ft(C)f_t(\cdot|C) denotes model output given text condition CC, ft()f_t(\cdot|\emptyset) is the unconditional output.

  • Integrated mean velocity target:

utargetcfg=vtcfg(tr)ddtfθ(xt,r,t)u_\text{target}^{cfg} = v_t^{cfg} - (t - r) \frac{d}{dt} f_\theta(x_t, r, t)

By training with CFG-based targets, prompt adherence at inference requires no added computational cost—contrasting with most prior TTA models in which CFG is realized by multiple sampling passes.

4. Training Curricula: Instantaneous-to-Mean and Flow Field Mix-Up

Training is divided into phases for increased stability and fidelity:

  • Phase 1: Instantiates standard flow matching, learning instantaneous velocity dynamics on full data.
  • Phase 2 (fine-tuning): Transitions to average velocity regression (MeanFlow) on high-quality samples, reinforcing long-range dynamics.
  • Flow Field Mix-Up: Randomly setting r=tr = t on a subset of training samples, so the model alternates between learning instantaneous (small displacement) and mean (long-range) flows. This technique mitigates instability when learning directly from mean flows, increases training robustness, and enables the model to maintain high-quality multi-step synthesis ability.

5. Performance Evaluation

MeanAudio achieves state-of-the-art metrics for single-step text-to-audio generation:

  • Real Time Factor (RTF): 0.013 on an NVIDIA RTX 3090—a 100× speedup compared to leading diffusion-based systems such as GenAU with 200 steps.
  • Quality metrics: Lower Fréchet Distance (FD), Fréchet Audio Distance (FAD), and Kullback-Leibler divergence (KL); higher Inception Score (IS) and CLAP similarity scores than diffusion-based baselines.
  • Multi-step generation: The underlying instantaneous dynamics, preserved through curriculum and mix-up, support smooth and coherent latent trajectories over multiple synthesis steps, thus improving quality further even when latency constraints are relaxed.
Metric MeanAudio (1-NFE) SOTA Diffusion (200-NFE)
RTF 0.013 ~1.3
FD, FAD, KL Lower Higher
IS, CLAP Higher Lower

A plausible implication is that MeanAudio is suitable both for low-latency, high-throughput applications and for higher fidelity synthesis when multi-step refinement is deployed.

6. Implementation and Practical Considerations

The architecture is instantiated with:

  • MMDiT blocks for joint audio-text processing,
  • DiT blocks for audio latent modeling,
  • ConvMLP and RoPE for temporal and relative position encoding,
  • RMSNorm for training stability,
  • Dual conditioning from FLAN-T5 and CLAP,
  • CFG baked into training loss,
  • Instantaneous-to-mean curriculum and flow field mix-up.

The paper provides pseudocode for the training algorithm and sampling loop, ensuring reproducibility and clarity regarding implementation. The single-step inference is expressed as

x0=x1fθ(x1,0,1),x_0 = x_1 - f_\theta(x_1, 0, 1),

where x1x_1 is a noisy prior latent and fθf_\theta is the trained MeanAudio transformer.

Typical resource requirements for training align with contemporary large-scale TTA models; the single-step inference and fast runtime properties dramatically reduce deployment compute.

7. Impact and Future Research Directions

MeanAudio demonstrates that average flow regression with a tailored transformer backbone and integrated CFG yields significant improvements in both synthesis efficiency and quality. This suggests that mean flow objectives may become central in future generative audio architectures—especially for latency-critical applications such as real-time audio synthesis and text-conditioned sound design.

Possible directions highlighted include:

  • Extending the framework to other modalities (e.g., text-to-music, video-to-audio) where fast flow-based generative schemes and prompt conditioning are instrumental.
  • Investigations into more flexible curriculum learning strategies and advanced mix-up mechanisms for general-purpose generative modeling.
  • Further exploration of unified architectures for single-step and multi-step synthesis regimes.

MeanAudio delivers a marked advance over prior fast and controllable TTA models through its unique integration of mean flows, curriculum-based stabilization, and classifier-free guidance—underpinning both its empirical efficiency and high-quality synthesis results.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube