Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 76 tok/s

Gemini 2.5 Pro 58 tok/s Pro

GPT-5 Medium 26 tok/s Pro

GPT-5 High 25 tok/s Pro

GPT-4o 81 tok/s Pro

Kimi K2 206 tok/s Pro

GPT OSS 120B 465 tok/s Pro

Claude Sonnet 4 35 tok/s Pro

2000 character limit reached

FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks (2502.04465v1)

Published 6 Feb 2025 in cs.LG, cs.AI, cs.SD, and eess.AS

Abstract: LLMs have revolutionized natural language processing through self-supervised pretraining on massive datasets. Inspired by this success, researchers have explored adapting these methods to speech by discretizing continuous audio into tokens using neural audio codecs. However, existing approaches face limitations, including high bitrates, the loss of either semantic or acoustic information, and the reliance on multi-codebook designs when trying to capture both, which increases architectural complexity for downstream tasks. To address these challenges, we introduce FocalCodec, an efficient low-bitrate codec based on focal modulation that utilizes a single binary codebook to compress speech between 0.16 and 0.65 kbps. FocalCodec delivers competitive performance in speech resynthesis and voice conversion at lower bitrates than the current state-of-the-art, while effectively handling multilingual speech and noisy environments. Evaluation on downstream tasks shows that FocalCodec successfully preserves sufficient semantic and acoustic information, while also being well-suited for generative modeling. Demo samples, code and checkpoints are available at https://lucadellalib.github.io/focalcodec-web/.

Summary

The paper introduces a novel focal modulation-based codec that compresses speech to bitrates as low as 0.16 kbps without sacrificing semantic or acoustic fidelity.
It achieves competitive performance with a UTMOS of 4.05 and a dWER of 2.18 on LibriSpeech test-clean, outperforming multi-codebook models in resynthesis and voice conversion.
Ablation studies confirm that focal modulation and binary spherical quantization are crucial for balancing compression efficiency with high-quality generative and discriminative speech tasks.

FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks

Introduction and Motivation

The FocalCodec framework addresses the persistent challenge in neural speech coding: achieving high-quality speech reconstruction and robust semantic representation at ultra-low bitrates, while maintaining architectural simplicity for downstream generative and discriminative tasks. Existing neural codecs often rely on multi-codebook designs, which increase complexity and computational cost, or sacrifice either semantic or acoustic fidelity at low bitrates. FocalCodec introduces a hybrid codec architecture leveraging focal modulation networks and a single binary codebook, compressing speech to bitrates as low as 0.16 kbps. The design is motivated by the need for efficient tokenization that preserves both semantic and acoustic information, enabling high-quality resynthesis, voice conversion, and downstream modeling.

Architecture and Design Choices

FocalCodec is built upon the VQ-VAE paradigm but innovates with a compressor-quantizer-decompressor pipeline between the encoder and decoder. The encoder utilizes the first six layers of WavLM-large, capturing both semantic and acoustic features. This choice is substantiated by prior work demonstrating that lower layers of self-supervised models retain significant acoustic detail, which is essential for hybrid codecs.

Figure 1: FocalCodec architecture. The encoder extracts features containing both acoustic and semantic information. These features are then mapped to a low-dimensional space by the compressor, binary quantized, and projected back by the decompressor. The decoder resynthesizes the waveform from these features.

The compressor employs focal modulation blocks, replacing conventional self-attention with a mechanism that aggregates global context and modulates local interactions hierarchically. This approach introduces translation equivariance, explicit input dependency, and efficient long-range dependency modeling, which are beneficial for speech representation. The quantizer utilizes binary spherical quantization (BSQ), a lookup-free method that projects latent vectors onto a unit hypersphere and applies binary quantization, resulting in a codebook size of $2^L$ for latent dimension $L$ . This enables granular quantization and high codebook utilization without the overhead of explicit codebook management.

The decompressor mirrors the compressor, reconstructing continuous encoder representations from quantized codes. The decoder adopts an asymmetric design, allocating more parameters to the encoder to prioritize representation quality, while using a lightweight Vocos architecture for efficient waveform synthesis. Training is decoupled: the compressor, quantizer, and decompressor are trained to reconstruct encoder features, while the decoder is trained to resynthesize audio from these features, using adversarial, reconstruction, and feature matching losses.

Experimental Results

Speech Resynthesis

FocalCodec demonstrates competitive performance in speech resynthesis across clean, multilingual, and noisy conditions. At 0.65 kbps (FocalCodec@50), it achieves a UTMOS of 4.05 and dWER of 2.18 on LibriSpeech test-clean, outperforming BigCodec and other state-of-the-art baselines in intelligibility and speaker similarity. The model generalizes well to multilingual speech and noisy environments, maintaining low dWER and high speaker similarity even at ultra-low bitrates.

Subjective Evaluation

Figure 2: Subjective evaluation from 21 participants averaged over 10 samples. Left: Trade-off between mean opinion score and bitrate. Right: Distribution of mean opinion score. FocalCodec@50 median is marginally lower than BigCodec, Stable Codec, and WavTokenizer, but user preference remains comparable when accounting for variability.

Subjective MUSHRA tests confirm that FocalCodec is Pareto-optimal up to 0.50 kbps, with user preference comparable to higher-bitrate codecs. The trade-off between perceived quality and bitrate is well-balanced, and FocalCodec maintains strong downstream task performance.

Voice Conversion

FocalCodec achieves the highest speaker similarity in one-shot voice conversion tasks, outperforming multi-codebook hybrid codecs such as SpeechTokenizer and Mimi, despite its single-codebook design. This indicates effective disentanglement of speaker and content information, a property typically associated with more complex architectures.

Downstream Tasks

On discriminative tasks (ASR, SI, SER), FocalCodec@50 ranks among the top models, with WER and error rates comparable to or better than multi-codebook and fine-tuned semantic codecs. On generative tasks (speech enhancement, separation, TTS), FocalCodec@50 and its lower-bitrate variants maintain high performance, with [email protected] excelling in TTS due to its compact token sequence, which simplifies autoregressive modeling.

Mel-Spectrogram Analysis

Figure 3: Reconstructed Mel-spectrograms from LibriSpeech (left) and Libri1Mix (right).

Spectrogram analysis reveals that FocalCodec preserves speech energy in relevant frequency bands, outperforming purely acoustic codecs in denoising and intelligibility, and avoiding artifacts introduced by other hybrid codecs.

Ablation and Comparative Studies

Ablation studies validate the architectural choices: replacing focal modulation with Conformer or AMP blocks, or substituting BSQ with vanilla LFQ, degrades performance in both objective and subjective metrics. The use of focal modulation and BSQ is critical for achieving the observed trade-off between compression and representation quality.

Comparisons with TS3-Codec, a transformer-only streaming codec, show that FocalCodec@50 surpasses TS3-Codec in all evaluated metrics, and FocalCodec@25 remains competitive at a significantly lower bitrate.

Implementation Considerations

FocalCodec is non-causal but can be streamed chunk-wise with acceptable performance at moderate chunk sizes. The model is trained on clean English speech at 16 kHz, with potential for further improvement through multilingual and higher-sample-rate training. The architecture is not optimized for music or environmental sounds, focusing exclusively on speech.

Implications and Future Directions

FocalCodec demonstrates that single-codebook, focal modulation-based architectures can achieve high-quality, low-bitrate speech coding without sacrificing semantic or acoustic fidelity. The efficient tokenization and robust representation learning make it well-suited for generative modeling, voice conversion, and downstream speech tasks. The use of BSQ and focal modulation introduces new avenues for scalable, efficient codec design.

Future work should address causality for real-time applications, expand training to broader domains and higher sample rates, and explore explicit mixture training for improved speech separation. The principles established by FocalCodec may inform the design of tokenizers for multimodal and cross-lingual speech models, as well as efficient codecs for edge deployment.

Conclusion

FocalCodec presents a significant advancement in low-bitrate neural speech coding, combining focal modulation networks and binary spherical quantization in a single-codebook architecture. It achieves competitive performance in speech resynthesis, voice conversion, and downstream tasks at ultra-low bitrates, with efficient token utilization and robust representation learning. The framework sets a new standard for codec simplicity and effectiveness, with broad implications for generative and discriminative speech modeling.