- The paper introduces SNAC, a multi-scale neural audio codec extending RVQ that adapts temporal resolution through hierarchical quantization for improved compression.
- SNAC enhances latent representations and codebook utilization via a noise injection mechanism and uses efficient depthwise convolutions and localized attention for robust architecture.
- Empirical validation demonstrates SNAC's superior performance with objective metrics and subjective tests, achieving high quality at very low bitrates for speech and music.
Overview
"SNAC: Multi-Scale Neural Audio Codec" (2410.14411) introduces an extension of the standard Residual Vector Quantization (RVQ) framework by employing a multi-scale quantization scheme. The approach selectively adapts temporal resolution through a hierarchy of quantizers, thereby efficiently capturing both low-level transient details and high-level structural information inherent in audio signals. This design offers improved compression efficiency and reconstruction quality for both speech and music domains.
Key Contributions
- Multi-Scale RVQ Extension: The paper extends conventional RVQ by performing quantization at varying temporal resolutions. Downsampling via average pooling, followed by a codebook lookup in a reduced resolution domain, and subsequent nearest-neighbor upsampling enables the codec to reconcile information at different temporal granularities.
- Noise Injection Mechanism: A noise block incorporated after each upsampling layer injects input-dependent Gaussian noise using a linear mapping (i.e., x ← x + Linear(x) ⊙ ε with ε ~ N(0,1)). This strategy enhances stochastic diversity in the latent representations and improves codebook utilization.
- Efficient Module Architectures: The codec leverages depthwise convolutions within its generator network to reduce parameter count and stabilize training dynamics, reducing the pitfalls associated with traditional GAN-based vocoders.
- Local Windowed Attention: Integrating a localized self-attention mechanism at the lowest temporal resolution in both encoder and decoder facilitates capturing contextual cues in the latent space, optimizing representation of longer-term dependencies.
- Empirical Validation: Through extensive objective metrics (ViSQOL, SI-SDR, Mel Distance, STFT Distance) and subjective assessments (MUSHRA evaluation), SNAC demonstrates superior performance. Notably, for speech, it preserves high quality even below 1 kbit/s bitrate, and for music, it shows comparable quality to DAC at substantially lower bitrates.
Methodology
SNAC is built upon an RVQGAN baseline. Its architecture features a hierarchical quantization process:
- Downsampling: The residual sequence is progressively downsampled at each quantization stage via average pooling, attenuating temporal resolution.
- Quantization: Each downsampled residual is then quantized through a vector quantization step, wherein a codebook is queried to represent the signal components.
- Upsampling: Post quantization, the signal is restored to its original temporal frame using nearest-neighbor interpolation. This procedure ensures that quantized representations align with the temporal structure of the original input.
- Noise Block Integration: At each upsampling stage, a noise block injects input-dependent stochasticity, enhancing the flexibility and capacity of the decoder to model fine-grained acoustic nuances.
- Attention and Convolution Modules: Depthwise convolutions mitigate training instability by reducing model complexity while local windowed attention layers at the coarsest scales capture extended contextual dependencies.
The training regimen employs a GAN framework with a multi-period discriminator and a multi-scale STFT discriminator to enforce robust spectral and temporal fidelity. Optimization is performed using AdamW with a learning rate initialization of 6e-4 and a per-iteration decay factor of λ = 0.999994, notably without any gradient clipping.
Evaluation and Results
The extensive quantitative evaluations underscore SNAC's effectiveness:
- Objective Metrics: Performance improvements are quantified using standard metrics such as ViSQOL, SI-SDR, Mel Distance, and STFT Distance. The ablation studies clearly delineate the contributions of the noise block, multi-scale processing, and attention mechanisms.
- Subjective Listening Tests: MUSHRA-like evaluations indicate that the perceived audio quality of SNAC-processed outputs outperforms traditional codecs by preserving important acoustic details even at low bitrates. Speech quality remains high below 1 kbit/s, while music reconstructions are comparable to DAC, despite utilizing significantly lower bitrates.
Conclusion
SNAC: Multi-Scale Neural Audio Codec provides a robust and efficient framework for neural audio compression by integrating multi-scale quantization, noise injection, and localized attention within an RVQ architecture. Its ability to adaptively process audio signals across multiple temporal scales results in superior compression performance and reconstruction fidelity, as validated through comprehensive objective and subjective assessments. The open-sourcing of the code and model weights further contributes to the reproducibility and potential adoption of this approach in audio generation and compression research.
In summary, SNAC represents a technically sound advancement in neural audio coding, efficiently balancing bitrate and quality across varying audio content through innovative multi-scale processing and robust architectural design.