High Fidelity Neural Audio Compression (2210.13438v1)

Published 24 Oct 2022 in eess.AS, cs.AI, cs.SD, and stat.ML

Abstract: We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary that efficiently reduces artifacts and produce high-quality samples. We introduce a novel loss balancer mechanism to stabilize training: the weight of a loss now defines the fraction of the overall gradient it should represent, thus decoupling the choice of this hyper-parameter from the typical scale of the loss. Finally, we study how lightweight Transformer models can be used to further compress the obtained representation by up to 40%, while staying faster than real time. We provide a detailed description of the key design choices of the proposed model including: training objective, architectural changes and a study of various perceptual loss functions. We present an extensive subjective evaluation (MUSHRA tests) together with an ablation study for a range of bandwidths and audio domains, including speech, noisy-reverberant speech, and music. Our approach is superior to the baselines methods across all evaluated settings, considering both 24 kHz monophonic and 48 kHz stereophonic audio. Code and models are available at github.com/facebookresearch/encodec.

PDF Abstract

High Fidelity Neural Audio Compression

The paper "High Fidelity Neural Audio Compression" presents an advanced real-time neural audio codec, named EnCodec, which demonstrates superior performance in high-fidelity audio compression over a range of bitrates and sampling rates. The EnCodec system uses a neural network model with an encoder-decoder architecture enhanced by adversarial training and a novel loss balance mechanism. The evaluation results suggest significant improvements in audio quality across multiple domains such as speech, noisy-reverberant speech, and music.

Technical Overview

Neural Codec Architecture

EnCodec employs a convolutional neural network-based encoder-decoder scheme, utilizing residual vector quantization (RVQ) to ensure efficient latent space representation. The encoder processes raw audio input into a latent representation, which is then quantized and reconstituted into an audio signal by the decoder. The system operates in real-time and supports streaming, making it suitable for applications requiring low-latency audio transfer.

The encoder consists of 1D convolutions coupled with residual units and LSTMs to capture various temporal features, whereas the decoder employs transposed convolutions to reconstruct the audio signal. This flexible architecture supports both monophonic and stereophonic audio, up to 48 kHz, with bitrate configurations varying between 1.5 kbps and 24 kbps.

Adversarial Training Strategy

To improve the audio reconstruction quality, the authors integrate a multi-scale spectrogram discriminator (MS-STFTD) as an adversarial loss mechanism. This approach ensures the generated audio samples are perceptually indistinguishable from real audio, effectively reducing common artifacts associated with neural codecs. The training also includes a relative feature matching loss, which improves generalization across varying audio inputs.

Loss Balancer Mechanism

A novel loss balancer mechanism is introduced to stabilize the training process, which decouples the scale of the loss gradients from predetermined hyper-parameters. This strategy balances the contributions of different loss terms, including time-domain and frequency-domain reconstruction losses, adversarial losses, and the quantization commitment loss, making the training more robust and interpretable.

Evaluation and Results

The authors present extensive subjective (MUSHRA tests) and objective (ViSQOL, SI-SNR) evaluations to demonstrate EnCodec's effectiveness. The evaluation datasets cover a variety of audio types, including clean and noisy speech, music, and general audio signals. EnCodec consistently outperformed traditional codecs such as Opus and EVS and other neural codecs like Lyra-v2 at equivalent bitrates.

Performance Metrics

Results indicate EnCodec achieves higher MUSHRA scores at all evaluated bitrates compared to Lyra-v2 and significant bitrate reductions without compromising audio quality when entropy coding is applied. Particularly notable is the performance at 3 kbps, where EnCodec surpasses both Opus and Lyra-v2, showcasing its efficacy in ultra low-bitrate scenarios.

Efficiency and Real-time Capability

The real-time factor (RTF) analysis shows that EnCodec operates at approximately 10 times faster than real-time on a single CPU core for 24 kHz audio, making it practical for deployment in latency-sensitive applications. The 48 kHz stereo model, while slightly slower than real-time, remains viable for non-live applications such as streaming where processing time is less critical.

Implications and Future Directions

The findings suggest several practical and theoretical implications:

Low-Latency Communication: EnCodec can significantly enhance the quality of voice and audio transmissions over limited bandwidth networks, facilitating more inclusive communication technologies.
Streaming Services: Services like music streaming and video conferencing can leverage EnCodec to deliver high-fidelity audio even at reduced bitrates, ensuring broader accessibility and reducing bandwidth costs.
Research Directions: Future work could explore enhancing the model's capabilities with larger, more complex neural architectures or integrating neural components to further improve compression rates. Additionally, cross-disciplinary applications such as augmented reality audio and smart home devices would benefit from these advancements.

In conclusion, EnCodec offers a substantial improvement in audio codec implementation, setting a new benchmark for neural audio compression in achieving high-fidelity results with real-time efficiency. Its potential applications span a wide range of fields, promising to make advanced audio experiences more accessible and efficient.