Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 23 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 79 tok/s Pro
Kimi K2 188 tok/s Pro
GPT OSS 120B 434 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

LongCat-Audio-Codec: Ultra-Low Bitrate Tokenizer

Updated 20 October 2025
  • LongCat-Audio-Codec is an end-to-end audio codec with a decoupled semantic-acoustic architecture that enables robust tokenization and high-quality synthesis at ultra-low bitrates.
  • It employs a three-stage training paradigm with a Transformer-based semantic encoder and an acoustic encoder using Adaptive Grouped Residual Vector Quantization for precise feature extraction.
  • The design supports low-latency streaming synthesis and flexible bitrate control, balancing coding efficiency with superior perceptual quality for speech recognition and LLM integration.

LongCat-Audio-Codec is an industrial-grade audio tokenizer and detokenizer specifically designed for end-to-end speech LLMs. It is architected for robust semantic modeling, flexible acoustic feature extraction, and low-latency streaming synthesis while operating at ultra-low bitrates—between 0.43 kbps and 0.87 kbps with a frame rate of 16.67 Hz. This approach enables strong speech intelligibility and high-quality synthesis, balancing coding efficiency and decoding quality. The system is publicly available at https://github.com/meituan-longcat/LongCat-Audio-Codec (Zhao et al., 17 Oct 2025).

1. Decoupled Semantic–Acoustic Model Architecture

LongCat-Audio-Codec employs a decoupled architecture, with two parallel tokenization branches:

  • Semantic encoder: Processes fbank features through a two-layer Conv2d front-end into 60 ms frames, followed by bi-directional Transformer blocks. Semantic tokens are extracted from the last hidden layer of a CTC fine-tuned model and clustered via K-means into an 8192-entry codebook. This separation ensures robust preservation of linguistic and contextual information while reducing dependency on low-level acoustics.
  • Acoustic encoder: A modified DAC architecture extracts fine-grained acoustic details using convolutional operations, ensuring high robustness on out-of-distribution data. Its output is quantized by an Adaptive Grouped Residual Vector Quantization (AGRVQ) module, which provides training stability and efficient search with large codebooks.

Both branches specialize, with semantic tokens encoding linguistic context and acoustic tokens capturing prosodic/timbre details. The decoder (detokenizer) fuses these representations for reconstruction.

2. Multistage Training Paradigm

LongCat-Audio-Codec adopts a three-stage training strategy:

  • Stage 1: Encoder Pretraining Trained on 500,000+ hours of speech using a mel loss (with discriminator introduced after stabilization). The dataset ensures exposure to diverse acoustic environments and focus on robust latent representation, favoring intelligibility.
  • Stage 2: Decoder Pretraining With the encoder and quantizer frozen, the decoder is trained for synthesis. The Train-More-Use-Less (TMUL) strategy is applied—training with many codebooks then deploying with fewer (e.g., using 2, 3, or 4 codebooks). Decoder adapts for each token to cover 1440 samples, converting input from 16 kHz to 24 kHz.
  • Stage 3: Decoder Fine-tuning (Optional) Few-shot supervised fine-tuning for specific speaker sets enhances reconstruction of identity and prosody, especially for targeted Speech LLM applications.

This strategy allows progressive improvement in quality, robustness, and efficiency.

3. Semantic Modeling Capabilities

Semantic modeling is conducted by a Transformer-based encoder operating on 60 ms frames of fbank features. Semantic representations are obtained from CTC-finetuned hidden layers—capturing rich context and linguistic information—before clustering into tokens. By decoupling semantic encoding to a dedicated codebook, interference from acoustic tokens is minimized, yielding tokens that are focused on maximizing language-related information. This improves downstream tasks such as speech recognition and language understanding in LLM-powered systems.

4. Acoustic Feature Extraction and Quantization

The acoustic encoder covers high-frequency properties, including timbre and prosody. It works at the same frame rate as the semantic encoder (60 ms), ensuring alignment between branches. The Adaptive Grouped Residual Vector Quantization (AGRVQ) splits the latent acoustic output with independent projection layers and adaptive grouping. For example, two internal codebooks of size 90 combine to 8100 entries, and additional acoustic codebooks allow 2–4 codebook configurations. This grouping increases quantization stability and search efficiency, permitting fine-grained acoustic detail reconstruction while compressing the representation.

5. Streaming Synthesis and Latency

The detokenizer is engineered for low-latency streaming synthesis, using both semantic and acoustic latents. It employs LSTM layers, convolution layers, and causal transposed convolutions, with a controlled three-frame lookahead (about 180 ms latency). This design supports real-time deployment and avoids the latency costs of diffusion-based decoders while maintaining training–inference consistency.

6. Bitrate Control and Tokenization Dynamics

LongCat-Audio-Codec encodes speech at a frame rate of 16.67 Hz (1 token per 60 ms). Acoustic tokens use a variable number of codebooks for bitrate control:

Codebook count Bitrate (kbps)
4 ~0.87
3 ~0.65
2 ~0.43

Semantic tokens (single codebook, 8192 entries) operate independently from acoustic tokens. This allows flexible trade-off between coding efficiency and decoding quality.

7. Evaluation and Comparative Analysis

Evaluations cover speech intelligibility (Word Error Rate, WER), prosodic accuracy (Gross Pitch Error, GPE), perceptual quality (PESQ, STOI), and timbre/speaker similarity (SECS). Comparative metrics show that LongCat-Audio-Codec either matches or surpasses both semantic codecs (Mimi, LLM-Codec, SemantiCodec) and acoustic codecs (EnCodec, DAC, TiCodec) at significantly lower bitrates. Increasing the number of codebooks for acoustic tokens leads to monotonic improvements in WER, GPE, PESQ, STOI, and SECS, indicating a clear bitrate–quality trade-off. The system supports high-fidelity, intelligible speech synthesis and robust tokenization for downstream LLM tasks.

8. Availability

LongCat-Audio-Codec's inference code and pre-trained checkpoints are openly available, supporting reproducibility and industrial deployment: https://github.com/meituan-longcat/LongCat-Audio-Codec


LongCat-Audio-Codec exemplifies a decoupled semantic–acoustic architecture paired with advanced quantization and training strategies to achieve ultra-low bitrate, high-quality speech tokenization for contemporary speech LLM platforms. Its streaming synthesis design and public availability position it as a benchmark solution in efficient neural audio coding (Zhao et al., 17 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to LongCat-Audio-Codec.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube