Papers
Topics
Authors
Recent
2000 character limit reached

Generative Audio Compression (GAC)

Updated 7 February 2026
  • Generative Audio Compression (GAC) is a paradigm that employs deep generative models to reconstruct high-fidelity audio from highly compressed, semantically rich latent representations.
  • It leverages conditional normalizing flows, diffusion models, and GANs to achieve ultra-low-bitrate transmission and enables generative editing across diverse audio domains such as speech, music, and sound effects.
  • GAC shifts the traditional rate–distortion trade-off by exchanging higher model capacity for reduced bitrate, with performance validated through objective metrics and subjective listening tests.

Generative Audio Compression (GAC) refers to a paradigm in audio coding that leverages expressive deep generative models to synthesize high-fidelity audio from highly compressed—or semantically factorized—representations. In contrast to traditional codecs, which prioritize signal fidelity under algorithmic transformations and quantization, GAC explicitly utilizes statistical priors on audio, enabling reconstruction of perceptually meaningful signals from extremely compact codes. Recent GAC architectures integrate conditional normalizing flows, diffusion models, GANs, and semantic bottlenecks, enabling ultra-low-bitrate transmission, task-aligned encoding, and generative editing capabilities across general audio domains including speech, music, and sound effects.

1. Theoretical Foundations and Rate–Distortion Tradeoffs

The defining theoretical framework of GAC is the explicit exchange between bitrate and generative model capacity, shifting the classic Shannon rate–distortion paradigm by introducing powerful priors at the decoder. The Law of Information Capacity (IC-1) formalizes this as:

ηN=D(H−L)\eta N = D (H - L)

where HH is source entropy, LL is cross-entropy loss (or residual coding rate), NN is the number of model parameters, DD is the effective data magnitude, and η\eta the information capacity per parameter. In practical GAC systems, as model capacity NN increases, the required channel bitrate RR for a fixed reconstruction quality can be reduced, encapsulating the "More Computation, Less Bandwidth" strategy (Ma et al., 31 Jan 2026). This computation–bandwidth trade-off is fundamental to enabling compression ratios unattainable by conventional codecs.

The compression pipeline is typically decomposed into an encoder extracting semantically or perceptually meaningful latents or tokens (potentially at rates as low as 0.1–7 kbps or frame rates as low as 7.8 Hz), and a decoder comprising a large-scale generative model that reconstructs the full-band audio, leveraging priors trained on vast audio corpora (Ma et al., 31 Jan 2026, Braun et al., 8 Oct 2025, Pia et al., 2024).

2. Architectural Paradigms: Tokenizers, Quantization, and Latent Spaces

GAC systems universally adopt a modular pipeline with (1) a front-end audio encoder, (2) a quantization bottleneck, and (3) a generative decoder.

3. Generative Decoding: Flows, Diffusion, GANs, and Consistency Models

The core innovation of GAC lies in the generative decoder, capable of synthesizing plausible, perceptually aligned audio from underdetermined or semantically reduced codes.

  • Conditional Normalizing Flows / ODE Solvers: FlowMAC and GAC employ ODE-based flow models, parameterized by U-Net/Transformer architectures, integrating conditional vector fields to reverse a learned stochastic interpolation between prior noise and the target waveform. Flow matching objectives train the flow field to match optimal transport between prior and conditional distributions (Pia et al., 2024, Ma et al., 31 Jan 2026).
  • Diffusion Models and Consistency Models: Many recent codecs (AVCC, Music2Latent, Music2Latent2) utilize iterative or (in advances such as Music2Latent) single-step denoising of noisy latent variables conditioned on encoder features, attaining both high fidelity and fast inference (Xu et al., 17 Dec 2025, Pasini et al., 2024, Pasini et al., 29 Jan 2025). Consistency learning allows direct map from noise to latent, enabling real-time feasible deployability.
  • Adversarial Generation (GAN Decoding): GAN-based decoders enforce distributional matching between generated and reference audio by adversarial training. Multi-resolution discriminators (MPD, MRD) and feature-matching objectives (L1, multi-scale STFT, perceptual embeddings) are ubiquitous, yielding high perceptual quality at low bitrates and restoring lost high-frequency content. Examples include MFCC-GAN, Improved RVQGAN, Penguins, Gull, DAC, and MelCap (Hasanabadi, 2023, Kumar et al., 2023, Liu et al., 2023, Luo et al., 2024, Li et al., 2 Oct 2025).
  • Semantic and Multi-modal Generation: AVCC demonstrates cross-modal decoding via a joint audio–video diffusion process, where audio and video latents participate in multi-head cross-attention inside denoising blocks; this enables conditional synthesis and even cross-modal reconstruction at ultra-low rates (Xu et al., 17 Dec 2025).

4. Evaluation Methodologies, Metrics, and Subjective Testing

GAC models are evaluated across objective, perceptual, and application-level benchmarks.

Objective and Perceptual Metrics:

Subjective Listening Tests:

Table: Example Bitrates and Perceptual Quality

Model/Codec Bitrate (kbps) MUSHRA MOS / FAD Notable Result
FlowMAC 3 ≃DAC/EnCodec 6 kbps Outperforms USAC 8 kbps at half the bitrate (Pia et al., 2024)
GAC (AI Flow framework) 0.275 MOS 4.1–4.2 ≈3000Ă— compression, outperforms semantic/wave codecs (Ma et al., 31 Jan 2026)
Vevo (semantic) 0.65–0.9 NISQA 4.2 Matches/Exceeds Encodec @3 kbps on task/perceptual (Collette et al., 18 Sep 2025)
Improved RVQGAN 8 ViSQOL 4.18 Outperforms Opus/Encodec at same rate (Kumar et al., 2023)
AVCC (low-rate) 0.36 VISQOL 2.9 Joint audio–video, better than neural baselines (Xu et al., 17 Dec 2025)

5. Applications, Limitations, and Future Directions

Applications:

  • Ultra-low-bitrate transmission: GAC systems, especially semantic and flow/diffusion-based, can operate at <1 kbps and still support intelligible communication, real-time dialog, and high perceptual quality (Ma et al., 31 Jan 2026, Collette et al., 18 Sep 2025).
  • Generative Editing and Model Chaining: Stable, token-based codecs with idempotent encoding enable iterative generation, cross-model handoff, and robust multi-step editing workstreams (O'Reilly et al., 2024).
  • Cross-modal/Multi-modal Compression: AVCC demonstrates cross-modal generative pipelines supporting strong lip-sync and mutual reconstruction between audio and video modalities (Xu et al., 17 Dec 2025).
  • Downstream Usage: GAC latents are proven to preserve semantic and instrumental structure for MIR, ASR, and classification—enabling joint generative-compression and discriminative tasks (Pasini et al., 2024, Pasini et al., 29 Jan 2025, Braun et al., 8 Oct 2025).

Limitations:

  • Inference Cost: Decoder complexity, especially for flow/ODE and diffusion models, can approach billions of parameters and substantial compute—posing challenges for on-device or real-time mobile deployment (Ma et al., 31 Jan 2026, Pia et al., 2024).
  • Vocoder Upper-bounds: Final perceptual quality is often bottlenecked by the vocoder (e.g., BigVGAN), rather than upstream coding (Pia et al., 2024).
  • Domain Generalization and OOD Robustness: Performance may degrade for out-of-distribution content. Robustness and generalization improvements are ongoing challenges (Pia et al., 2024, O'Reilly et al., 2024).
  • Latency and Algorithmic Delay: Some architectures (especially those relying on large-context or ODE/diffusion sampling) may preclude sub-100 ms streaming or synchronous telecom usage (Pia et al., 2024).

Open Problems and Future Work:

6. Comparison of Representative Architectures

Model Encoder Domain Quantization Generative Decoder Bitrate (kbps) Core Innovation Reference
FlowMAC Mel-Spectrogram 8×RVQ, 256 Conditional Flow (ODE) 1.5–6 CFM-based decoder, real-time (Pia et al., 2024)
GAC, AI Flow Tokenizer, VQ Discrete tokens Large flow-matching ODE 0.175–0.275 VIB semantic bottleneck (Ma et al., 31 Jan 2026)
Improved RVQGAN Waveform 9Ă—RVQ GAN (multi-scale disc.) 8 Universal, 90Ă— compress (Kumar et al., 2023)
DAC/DACe Conv waveform 16–32×RVQ GAN 8–32 Balanced music/speech tokens (Biswas et al., 23 Sep 2025)
Music2Latent STFT Continuous Consistency, 1-step UNet – Single-step, 64× compression (Pasini et al., 2024)
Music2Latent2 STFT Summary emb. Autoregressive Consistency – Unordered chunk-level latents (Pasini et al., 29 Jan 2025)
Gull STFT-subband Spherical RVQ Elastic RNN/Adversarial 2–24 User-tunable complexity (Luo et al., 2024)
AVCC Dual encoder k-means Joint AV Diffusion 0.36–1.4 Cross-modal tokenization (Xu et al., 17 Dec 2025)
MelCap Mel-spectrogram 2D VQ (1 code) GAN vocoder ~2–6 One-codebook for all audio (Li et al., 2 Oct 2025)
MFCC-GAN Handcrafted MFCC Scalar Q GAN 8–36 DSP front, GAN upsampler (Hasanabadi, 2023)
SALAD-VAE STFT VAE (cont.) Adversarial VAE ~2–16 Semantic contrastive/CLAP (Braun et al., 8 Oct 2025)
Semantic Codec(Vevo) Transformer Token/Embed Flow-matching, TTS style 0.65–0.9 Content/timbre factorization (Collette et al., 18 Sep 2025)

7. Idempotence, Stability, and Generative Editing

Idempotent neural codecs, preserving token/code-streams under multiple rounds of encoding/decoding, are critical for stable generative workflows. Adding a latent quantized idempotence loss ensures long-range token stability, improves robustness against phase perturbations, and enables safe iterative generative editing, generative model chaining, and robust transcoding (O'Reilly et al., 2024). High codebook entropy and faithful phase encoding are necessary for this property in GAC pipelines.


GAC has thus rapidly advanced to the forefront of audio compression research, coupling deep generative models with domain-aware encoding, attaining bitrates previously exclusive to extreme parametric coders, while delivering perceptually high-quality, semantically meaningful, and generatively flexible audio across domains. Current challenges include efficient real-time deployment, domain adaptation, multi-modal factorization, comprehensive OOD robustness, and integrating joint task-driven objectives in end-to-end training.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Generative Audio Compression (GAC).