SoundStream: An End-to-End Neural Audio Codec (2107.03312v1)

Published 7 Jul 2021 in cs.SD, cs.LG, and eess.AS

Abstract: We present SoundStream, a novel neural audio codec that can efficiently compress speech, music and general audio at bitrates normally targeted by speech-tailored codecs. SoundStream relies on a model architecture composed by a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end. Training leverages recent advances in text-to-speech and speech enhancement, which combine adversarial and reconstruction losses to allow the generation of high-quality audio content from quantized embeddings. By training with structured dropout applied to quantizer layers, a single model can operate across variable bitrates from 3kbps to 18kbps, with a negligible quality loss when compared with models trained at fixed bitrates. In addition, the model is amenable to a low latency implementation, which supports streamable inference and runs in real time on a smartphone CPU. In subjective evaluations using audio at 24kHz sampling rate, SoundStream at 3kbps outperforms Opus at 12kbps and approaches EVS at 9.6kbps. Moreover, we are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency, which we demonstrate through background noise suppression for speech.

PDF Abstract

SoundStream: An Overview of a Neural Audio Codec

The paper introduces SoundStream, a neural audio codec designed to efficiently compress a variety of audio types, including speech and music, at lower bitrates than traditional speech-specific codecs. The principal architecture of SoundStream consists of a fully convolutional encoder-decoder network paired with a residual vector quantizer, all trained jointly in an end-to-end manner. The training process utilizes advancements in text-to-speech and speech enhancement, incorporating both adversarial and reconstruction losses, to generate high-quality audio content from quantized embeddings.

A noteworthy feature of SoundStream is its adaptability across a range of bitrates, from 3 kbps to 18 kbps, with minimal quality degradation relative to fixed-bitrate models. This is achieved through a "quantizer dropout" strategy applied during training, which facilitates a flexible bitrate handling capability within a single model framework. SoundStream is designed for low-latency implementation, enabling real-time streaming on smartphone CPUs.

Empirical evaluations highlight SoundStream's performance superiority; at 3 kbps, it surpasses the Opus codec operating at 12 kbps and approaches the performance of the EVS codec at 9.6 kbps. Moreover, it can execute joint compression and enhancement tasks, such as noise suppression, without additional latency.

Technical Contributions

Neural Audio Codec: SoundStream integrates a convolutional model with residual vector quantizer, ensuring high audio quality at reduced bitrates, trained using adversarial and reconstruction losses.
Residual Vector Quantizer: A novel learnable module has been introduced, enhancing the rate-distortion-complexity trade-offs essential for superior codec performance.
Bitrate Scalability: SoundStream's architecture allows for bitrate scalability within a single model, realized through quantizer dropout, which provides training flexibility across multiple bitrates with minimal quality impact.
Streaming and Real-time Capability: Designed to support low-latency, streamable inference, SoundStream operates in real-time on consumer-grade hardware like smartphone CPUs.
Joint Compression and Enhancement: SoundStream can simultaneously execute compression and audio enhancement (e.g., noise suppression) without incurring additional latency.

Implications and Future Directions

The research presented in this paper underscores significant implications for both the theoretical and practical domains of audio processing technology. The novel approach to combining adversarial and reconstruction losses in training marks a forward path for achieving high fidelity in audio compression. SoundStream's demonstrated performance represents progress toward more efficient new-generation audio codecs, with the potential to influence a broad spectrum of real-time audio applications, from telecommunications to media streaming services.

Looking forward, future developments may involve optimizing the codec's performance for even lower latencies and exploring its adaptability to various audio domains. Additionally, the potential synergy between SoundStream's compression capabilities and advanced audio enhancement techniques opens up avenues for further research into integrated audio processing systems that can adaptively manage varying audio environments and conditions.

This work contributes to the landscape of neural compression technologies, demonstrating a mature approach to end-to-end system design that balances efficiency, quality, and flexibility in audio coding. As new advancements in neural architectures and quantization strategies emerge, SoundStream sets a benchmark for achieving high-performance audio synthesis at reduced bitrates.