High-Fidelity Neural Audio Compression
- High-fidelity neural audio compression is a set of deep neural network methods that encode audio signals into compact, discrete tokens while preserving perceptual quality at extremely low bitrates.
- These techniques integrate encoder, quantization, and decoder stages using waveform and spectrogram representations along with dual-branch amplitude-phase models for accurate reconstruction.
- Advanced strategies like Residual Vector Quantization, adaptive expert routing, and compound loss functions enable real-time operation, generative modeling, and low-latency streaming.
High-fidelity neural audio compression refers to a class of methods that use deep neural networks to achieve extreme compaction of audio signals—speech, music, and general sounds—into compact, often discrete, representations that preserve perceptual and signal fidelity even at very low bitrates. These techniques have surpassed traditional codecs in bandwidth efficiency and quality, and are foundational to generative audio modeling, low-latency streaming, and edge computing.
1. Foundational Architectures and Representation Strategies
Neural audio codecs consist of three main stages: encoding, quantization, and decoding/generation. The prevailing architecture involves an encoder network that transforms the raw waveform or a spectral representation (e.g., Mel-spectrogram or STFT coefficients) into a sequence or grid of continuous latent vectors. These latents are then quantized, most commonly using vector quantization (VQ) or residual vector quantization (RVQ), mapping continuous features to discrete indices. A decoder, often a transposed-convolutional neural network or neural vocoder, reconstructs the audio waveform or spectral features from these tokens (Défossez et al., 2022, Kumar et al., 2023, Feng et al., 21 Mar 2025, Zhang et al., 6 Jan 2026).
Two primary representation strategies dominate:
- Waveform-domain codecs: Directly encode time-domain samples. Typical of early works such as SoundStream (Zeghidour et al., 2021), EnCodec (Défossez et al., 2022), and Improved RVQGAN (Kumar et al., 2023), these systems leverage convolutional encoders/decoders and deep RVQ stacks for flexible, universal modeling.
- Spectrogram-domain codecs: Encode log-magnitude or Mel-spectrograms, sometimes including phase features. The compressed latent is subsequently inverted to the time domain using powerful neural vocoders, often BigVGAN or its variants, exploiting the compactness and perceptually grounded properties of these spectral domains (Feng et al., 21 Mar 2025, Li et al., 2 Oct 2025, Zhang et al., 6 Jan 2026).
Recent methods hybridize these approaches, introducing dual-branch encoders for amplitude and phase (APCodec+ (Du et al., 2024)), transformer-based tokenizer blocks (MelCap (Li et al., 2 Oct 2025)), spectral residual processing (MUFFIN (Ng et al., 12 May 2025)), or bandwise VQ and psychoacoustic allocation.
2. Advances in Quantization: From RVQ to Expert Routing
Residual Vector Quantization (RVQ) has become a de facto standard. RVQ stacks multiple quantizer layers, where each codebook quantizes the residual error left by the previous, multiplying code capacity at linear bitrate cost (Défossez et al., 2022, Kumar et al., 2023, Siuzdak et al., 2024). This framework allows for variable bitrates via quantizer dropout or by adjusting the codebook stack at inference (Zeghidour et al., 2021).
Sparse/Expert Quantization: Recent breakthroughs depart from rigid fixed-stage RVQ, proposing adaptive, input-dependent quantizer routing. SwitchCodec introduces Residual Experts Vector Quantization (REVQ), which combines a shared base codebook with a large pool of expert codebooks, routing only a small, dynamically selected subset per windowed latent. This approach decouples total codebook capacity from bandwidth, yielding both high fidelity at low bitrates and the ability to scale to arbitrarily large embedding spaces without commensurate bitrate inflation (Wang et al., 30 May 2025, Wang et al., 28 Jan 2026). Adaptive gating and mask computation ensure efficient expert utilization (DRPS strategy), and the variable-bitrate mechanism supports real-time adaptation, all without retraining.
Single-Codebook Tokenization: Motivated by the overhead of multi-stream quantization for downstream modeling, several architectures aim for high-fidelity compression via a single codebook. UniSRCodec achieves this using a 2D encoder/quantizer on Mel-spectrograms with sub-band reconstruction, enabling cross-domain high fidelity at a token rate as low as 40 tokens/sec (≈0.52 kbps), outperforming previous single-codebook methods by a wide margin (Zhang et al., 6 Jan 2026). MelCap follows a similar two-stage design (spectrogram + vocoder), further showing that single-codebook codecs can match or surpass multi-codebook approaches at comparable rates (Li et al., 2 Oct 2025).
3. Spectral and Perceptual Signal Modeling
Spectral Feature Compression: Operating in spectral domains offers quadratic compression benefits due to the joint time–frequency downsampling, and mel/bandwise quantization aligns with perceptual sensitivity. UniSRCodec reduces both axes from 128×128 to 8×8, facilitating ultra-low token rates without fidelity collapse (Zhang et al., 6 Jan 2026). STFTCodec further incorporates phase-derivative features to permit phase-aware, time-frequency modeling, and achieves superior performance to waveform-based systems at similar rates (Feng et al., 21 Mar 2025).
Psychoacoustic Codebook Allocation: MUFFIN exemplifies perceptually guided resource allocation. Multi-band VQ is used, with the bit allocation in each frequency band determined by salience computed from psychoacoustic masking thresholds and power spectral density. This disentangles content-carrying tokens (low/mid bands) from identity cues (high bands), significantly improving both compression efficiency and downstream generative compatibility (Ng et al., 12 May 2025).
Phase Modeling: Recent codecs have increased emphasis on accurate phase reconstruction, such as the parallel amplitude/phase subnetworks of APCodec+ and the use of advanced vocoders (BigVGAN-v2) in UniSRCodec and MBCodec. These approaches guarantee that reconstructed waveforms maintain temporal microstructure and naturalness, especially at low bitrates (Du et al., 2024, Zhang et al., 6 Jan 2026).
4. Loss Design and Adversarial Training
Compound Objective Functions: All high-fidelity codecs employ a blend of losses:
- Multi-scale waveform and spectrogram (Mel/STFT) L1/L2 losses to encourage accurate amplitude structure at several resolutions.
- Adversarial (GAN) losses, often with multi-period and multi-scale STFT discriminators, to capture fine-grained phase and spectral detail and suppress artifacts (Défossez et al., 2022, Kumar et al., 2023, Wang et al., 30 May 2025).
- Feature-matching losses over discriminator activations for stable generator updates.
- Commitment and codebook losses to balance training between encoder, quantizer, and decoder.
Training Paradigms: APCodec+ and AudioDec introduce staged training, freezing the encoder/quantizer after joint training and then fine-tuning the decoder/discriminator. This decouples the difficult adversarial phase from representation learning, improving convergence and final fidelity at low bitrates (Du et al., 2024, Wu et al., 2023).
Progressive/Flexible Rate Control: Quantizer dropout (SoundStream, Improved RVQGAN) or adaptive gating (SwitchCodec) allows trained models to provide graceful trade-offs between fidelity and bitrate at inference, without need for retraining (Zeghidour et al., 2021, Wang et al., 28 Jan 2026).
5. Empirical Evaluation, Benchmarks, and Limitations
Codec efficacy is established on standard datasets such as AudioSet, LibriTTS, VCTK, FMA, and GTZAN. Evaluation metrics include:
- Perceptual (PESQ, ViSQOL, POLQA, MUSHRA)
- Intelligibility (eSTOI, STOI)
- Fidelity (Mel/STFT L1, SI-SDR, log-spectral distance)
- Semantic preservation for downstream tasks (sound/event classification, ASR, speaker similarity)
Key findings include:
- State-of-the-art perceptual quality at 0.3–3 kbps, with SwitchCodec achieving MUSHRA 91.7 at 2.67 kbps, UniSRCodec-B matching or exceeding prior single-codebook methods at 0.52 kbps, and GAC approaching MOS 4.18 at 0.275 kbps (Zhang et al., 6 Jan 2026, Ma et al., 31 Jan 2026).
- Superior cross-domain generality: Many recent codecs compress speech, music, and effects with a single model, crucial for generative language modeling.
- Idempotence: Code Drift (O'Reilly et al., 2024) investigates repeated re-encoding, finding that fine-tuning with codebook-space idempotence loss can yield stable token reuse and preserve quality for multiple passes.
- Real-time, streaming operation: HILCodec, AudioDec, Penguins, and SoundStream run faster than real time on CPUs and mobiles, with sub-10 ms algorithmic latency at common window sizes (Ahn et al., 2024, Wu et al., 2023, Liu et al., 2023, Zeghidour et al., 2021).
Limitations reported include potential trade-offs between streaming/ultra-low-latency operation and bitrate overhead (mask transmission), instability in adversarial training, and domain specialization—many codecs are evaluated chiefly on speech or universal audio, requiring further work for robust adaptation to music or highly nonstationary signals (Wang et al., 30 May 2025, Zhang et al., 6 Jan 2026, Liu et al., 2023).
6. Downstream and Generative Applications
High-fidelity neural codecs underpin modern generative modeling (TTS, audio language modeling). MUFFIN and MBCodec demonstrate that bandwise and disentangled VQ designs facilitate both compression and highly effective token representations for models such as VALL-E and masked-transformer vocoders, yielding superior WER, MOS, and speaker similarity in zero-shot TTS and speech generation (Ng et al., 12 May 2025, Zhang et al., 21 Sep 2025).
Generative Audio Compression (GAC) introduces a paradigm where semantic tokens, obtained from deeply learned audio-understanding encoders, are transmitted and a very large diffusion/flow-based decoder reconstructs audio, shifting the rate–distortion trade-off in favor of more compute and less bandwidth (MOS ≈ 4.18 at 0.275 kbps) (Ma et al., 31 Jan 2026). This approach generalizes beyond traditional signal fidelity toward task-oriented effectiveness, leveraging model priors to compensate for ultra-aggressive compression.
References
- UniSRCodec: (Zhang et al., 6 Jan 2026)
- SwitchCodec: (Wang et al., 30 May 2025, Wang et al., 28 Jan 2026)
- MelCap: (Li et al., 2 Oct 2025)
- EnCodec: (Défossez et al., 2022)
- Improved RVQGAN: (Kumar et al., 2023)
- MUFFIN: (Ng et al., 12 May 2025)
- MBCodec: (Zhang et al., 21 Sep 2025)
- GAC: (Ma et al., 31 Jan 2026)
- Code Drift: (O'Reilly et al., 2024)
- HILCodec: (Ahn et al., 2024)
- APCodec+: (Du et al., 2024)
- SNAC: (Siuzdak et al., 2024)
- AudioDec: (Wu et al., 2023)
- SoundStream: (Zeghidour et al., 2021)
- Penguins: (Liu et al., 2023)
- STFTCodec: (Feng et al., 21 Mar 2025)
- HH-Codec: (Xue et al., 25 Jul 2025)
- SQCodec: (Zhai et al., 7 Apr 2025)
- Siamese SIREN: (Lanzendörfer et al., 2023)