SoundStream: An End-to-End Neural Audio Codec

Published 7 Jul 2021 in cs.SD, cs.LG, and eess.AS | (2107.03312v1)

Abstract: We present SoundStream, a novel neural audio codec that can efficiently compress speech, music and general audio at bitrates normally targeted by speech-tailored codecs. SoundStream relies on a model architecture composed by a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end. Training leverages recent advances in text-to-speech and speech enhancement, which combine adversarial and reconstruction losses to allow the generation of high-quality audio content from quantized embeddings. By training with structured dropout applied to quantizer layers, a single model can operate across variable bitrates from 3kbps to 18kbps, with a negligible quality loss when compared with models trained at fixed bitrates. In addition, the model is amenable to a low latency implementation, which supports streamable inference and runs in real time on a smartphone CPU. In subjective evaluations using audio at 24kHz sampling rate, SoundStream at 3kbps outperforms Opus at 12kbps and approaches EVS at 9.6kbps. Moreover, we are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency, which we demonstrate through background noise suppression for speech.

Abstract PDF Upgrade to Chat

Citations (606)

View on Semantic Scholar

Summary

The paper introduces SoundStream, a neural audio codec that compresses speech and music using a convolutional encoder-decoder paired with a residual vector quantizer.
It employs innovative training techniques, including adversarial and reconstruction losses with quantizer dropout, to achieve scalable bitrate compression and low latency.
Empirical results demonstrate that SoundStream outperforms conventional codecs at low bitrates while enabling real-time joint compression and enhancement on consumer hardware.

SoundStream: An Overview of a Neural Audio Codec

The paper introduces SoundStream, a neural audio codec designed to efficiently compress a variety of audio types, including speech and music, at lower bitrates than traditional speech-specific codecs. The principal architecture of SoundStream consists of a fully convolutional encoder-decoder network paired with a residual vector quantizer, all trained jointly in an end-to-end manner. The training process utilizes advancements in text-to-speech and speech enhancement, incorporating both adversarial and reconstruction losses, to generate high-quality audio content from quantized embeddings.

A noteworthy feature of SoundStream is its adaptability across a range of bitrates, from 3 kbps to 18 kbps, with minimal quality degradation relative to fixed-bitrate models. This is achieved through a "quantizer dropout" strategy applied during training, which facilitates a flexible bitrate handling capability within a single model framework. SoundStream is designed for low-latency implementation, enabling real-time streaming on smartphone CPUs.

Empirical evaluations highlight SoundStream's performance superiority; at 3 kbps, it surpasses the Opus codec operating at 12 kbps and approaches the performance of the EVS codec at 9.6 kbps. Moreover, it can execute joint compression and enhancement tasks, such as noise suppression, without additional latency.

Technical Contributions

Neural Audio Codec: SoundStream integrates a convolutional model with residual vector quantizer, ensuring high audio quality at reduced bitrates, trained using adversarial and reconstruction losses.
Residual Vector Quantizer: A novel learnable module has been introduced, enhancing the rate-distortion-complexity trade-offs essential for superior codec performance.
Bitrate Scalability: SoundStream's architecture allows for bitrate scalability within a single model, realized through quantizer dropout, which provides training flexibility across multiple bitrates with minimal quality impact.
Streaming and Real-time Capability: Designed to support low-latency, streamable inference, SoundStream operates in real-time on consumer-grade hardware like smartphone CPUs.
Joint Compression and Enhancement: SoundStream can simultaneously execute compression and audio enhancement (e.g., noise suppression) without incurring additional latency.

Implications and Future Directions

The research presented in this paper underscores significant implications for both the theoretical and practical domains of audio processing technology. The novel approach to combining adversarial and reconstruction losses in training marks a forward path for achieving high fidelity in audio compression. SoundStream's demonstrated performance represents progress toward more efficient new-generation audio codecs, with the potential to influence a broad spectrum of real-time audio applications, from telecommunications to media streaming services.

Looking forward, future developments may involve optimizing the codec's performance for even lower latencies and exploring its adaptability to various audio domains. Additionally, the potential synergy between SoundStream's compression capabilities and advanced audio enhancement techniques opens up avenues for further research into integrated audio processing systems that can adaptively manage varying audio environments and conditions.

This work contributes to the landscape of neural compression technologies, demonstrating a mature approach to end-to-end system design that balances efficiency, quality, and flexibility in audio coding. As new advancements in neural architectures and quantization strategies emerge, SoundStream sets a benchmark for achieving high-performance audio synthesis at reduced bitrates.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Authors (5)

Collections

Tweets

YouTube

Show All Videos

SoundStream: An End-to-End Neural Audio Codec

Summary

SoundStream: An Overview of a Neural Audio Codec

Technical Contributions

Implications and Future Directions

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (5)

Collections

Tweets

YouTube

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

SoundStream: An End-to-End Neural Audio Codec

Summary

SoundStream: An Overview of a Neural Audio Codec

Technical Contributions

Implications and Future Directions

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (5)

Collections

Tweets

YouTube

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research