Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

140 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound (2405.00233v2)

Published 30 Apr 2024 in cs.SD, cs.AI, cs.MM, eess.AS, and eess.SP

Abstract: LLMs have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of LLMling techniques to audio data. However, traditional codecs often operate at high bitrates or within narrow domains such as speech and lack the semantic clues required for efficient LLMling. Addressing these challenges, we introduce SemantiCodec, a novel codec designed to compress audio into fewer than a hundred tokens per second across diverse audio types, including speech, general sound, and music, without compromising quality. SemantiCodec features a dual-encoder architecture: a semantic encoder using a self-supervised pre-trained Audio Masked Autoencoder (AudioMAE), discretized using k-means clustering on extensive audio data, and an acoustic encoder to capture the remaining details. The semantic and acoustic encoder outputs are used to reconstruct audio via a diffusion-model-based decoder. SemantiCodec is presented in three variants with token rates of 25, 50, and 100 per second, supporting a range of ultra-low bit rates between 0.31 kbps and 1.40 kbps. Experimental results demonstrate that SemantiCodec significantly outperforms the state-of-the-art Descript codec on reconstruction quality. Our results also suggest that SemantiCodec contains significantly richer semantic information than all evaluated state-of-the-art audio codecs, even at significantly lower bitrates. Our code and demos are available at https://haoheliu.github.io/SemantiCodec/.

References (75)

Citations (12)

View on Semantic Scholar

Summary

The paper introduces a dual-encoder system that separates semantic and acoustic features for efficient audio compression.
The paper employs a diffusion model-based decoder to reconstruct high-quality audio using as few as 25 tokens per second.
The paper demonstrates competitive audio quality at ultra low bitrates, achieving satisfactory performance even at 0.31 kbps.

Exploring SemantiCodec: An Innovative Approach to Ultra Low Bitrate Audio Compression

Introduction to Semantic Audio Codecs

Audio codecs are crucial tools that help encode and decode digital audio, optimizing it for efficient telecommunication and broadcasting. Traditional audio codecs focus primarily on discarding inaudible parts of sound to compress data, but latest advancements utilize AI to improve both quality and compression rates. Most notably, these AI-driven codecs use techniques like vector quantization, where audio data is transformed into tokens, much like how words are tokenized in NLP.

However, when it comes to efficiently encoding varied audio types (like speech, music, or ambient sounds), maintaining a balance between compression (low bitrate) and audio quality becomes increasingly complex. Addressing this balance is where SemantiCodec, a novel semantic audio codec, makes its mark, achieving impressive compression at ultra low bit rates without sacrificing the quality.

Core Innovations of SemantiCodec

Dual-Encoder Structure: SemantiCodec uses a unique dual-encoder system comprising a semantic encoder and an acoustic encoder. This architecture allows it to effectively compress audio while retaining crucial sound details.

Semantic Encoder: It leverages a machine learning model called AudioMAE, designed for self-supervised learning, which means it learns from the data without needing explicit labels. The encoder processes the audio to extract meaningful features, which are then clustered using k-means to produce a compact representation—referred to as semantic tokens.
Acoustic Encoder: This component captures the finer acoustic details that the semantic encoder might miss. It's essential for restoring the audio to a high quality during decoding.

Token Efficiency: Classic codecs often require high token rates (hundreds of tokens per second), which can hamper computational efficiency. SemantiCodec, however, manages to lower the token rate drastically, to as few as 25 tokens per second, significantly easing the computational load without degrading the audio output.

Diffusion Model-Based Decoder: For reconstructing audio from the encoded tokens, SemantiCodec uses advanced generative models known as diffusion models, acclaimed for their ability to generate high-quality outputs. By conditioning on both semantic and acoustic tokens, the system ensures the reconstructed audio remains both accurate and semantically rich.

Empirical Evaluation and Results

SemantiCodec is thoroughly evaluated against existing codecs like the Descript codec under various metrics:

Semantic Richness: It excels in retaining more semantic information at even lower bitrates, important for applications in LLMs and more intuitive audio processing tasks.
Reconstruction Quality: Semantically rich and lower bitrates allow for high-quality audio reconstruction, surpassing many state-of-the-art codecs, particularly at bitrates below 1.5 kbps.

The tests confirm that at ultra-low bitrates (as low as 0.31 kbps), SemantiCodec still provides satisfactory audio quality which is competitive with if not superior to rates offered by much higher bitrate systems.

The Path Forward

While SemantiCodec introduces promising advancements in audio processing, future developments could explore even deeper integrations of semantic information. Enhancing the efficiency of the encoding and decoding processes, possibly through further AI optimizations, could allow for real-time applications in more bandwidth-sensitive environments.

Moreover, incorporating multi-modal learning, where the system could learn from not only audio but related modalities like video or text, could pave the way for more robust and versatile semantic audio codecs.

Conclusion

SemantiCodec has made significant strides in demonstrating that it's indeed possible to retain high audio quality at remarkably low bitrates with rich semantic understanding. This codec not only stands to benefit the traditional domains of telecommunications and broadcasting but also opens new avenues in smart devices, streaming services, and AI-powered audio applications, where efficiency and quality are paramount.

PDF Markdown

Tweets

https://twitter.com/LiuHaohe/status/1785961357633814661

https://twitter.com/_akhaliq/status/1785893486882930770

https://twitter.com/vek/status/1789691732981883291