LongCat-Audio-Codec: An Audio Tokenizer and Detokenizer Solution Designed for Speech Large Language Models

Published 17 Oct 2025 in eess.AS and cs.SD | (2510.15227v1)

Abstract: This paper presents LongCat-Audio-Codec, an audio tokenizer and detokenizer solution designed for industrial grade end-to-end speech LLMs. By leveraging a decoupled model architecture and a multistage training strategy, LongCat-Audio-Codec exhibits robust semantic modeling capabilities, flexible acoustic feature extraction capabilities, and low-latency streaming synthesis capabilities. It encodes speech at an ultra-low frame rate of 16.67 Hz, with a minimum bitrate of 0.43 kbps and a maximum bitrate of 0.87 kbps. Evaluation results demonstrate that LongCat-Audio-Codec achieves strong speech intelligibility and is capable of synthesizing highquality speech at low bitrate, thus effectively balancing coding efficiency and decoding quality. The inference code and model checkpoints of LongCat-Audio-Codec are available at: https://github.com/meituan-longcat/LongCat-Audio-Codec.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces a dual-path semantic-acoustic audio tokenizer and detokenizer that efficiently compresses speech at ultra-low bitrates while preserving intelligibility.
It employs a multistage training strategy using convolution, transformer, and LSTM layers to achieve real-time, high-quality speech synthesis.
Empirical evaluations show superior performance in metrics like WER, PESQ, and STOI, establishing a new benchmark for speech large language models.

LongCat-Audio-Codec: An Advanced Audio Tokenization Framework for Speech LLMs

Overview

The paper "LongCat-Audio-Codec: An Audio Tokenizer and Detokenizer Solution Designed for Speech LLMs" (2510.15227) introduces LongCat-Audio-Codec, an innovative audio processing framework optimized for Speech LLMs. Leveraging a decoupled semantic-acoustic architecture and multistage training methodologies, it achieves exceptional low-bitrate compression while maintaining high speech intelligibility.

LongCat-Audio-Codec encodes speech at an ultra-low frame rate of 16.67 Hz, with a bitrate ranging from 0.43 kbps to 0.87 kbps, demonstrating the codec's robust trade-off between efficiency and quality. This document provides an authoritative essay on its architecture, design challenges, rationale, and empirical evaluations.

Architectural Design

Tokenizer Architecture

The LongCat-Audio-Codec employs a dual-path semantic-acoustic tokenizer to mitigate limitations inherent in using pure acoustic or semantic tokens alone. This configuration optimally preserves both fine-grained acoustic features and high-level semantic content.

Semantic Encoder: Implements convolutional and transformer layers (Figure 1) for extracting rich linguistic information from fbank features, using Kmeans clustering for semantic tokens.
Figure 1: Architecture of semantic encoder.
Acoustic Encoder: Complements high-frequency feature information through convolutional mechanisms. AGRVQ (Adaptive Grouped Residual Vector Quantization) techniques stabilize training in large codebook contexts, enhancing reconstruction quality (Figure 2).
Figure 2: Architecture of AGRVQ.

Decoder Architecture

The audio detokenizer is optimized for streaming and low-latency operations, using LSTM and convolution layers to ensure real-time, high-quality speech synthesis (Figure 3). This design significantly reduces computational complexity compared to diffusion-based approaches.

Figure 3: Architecture of detokenizer (decoder).

Design Challenges and Solutions

Text-Speech Multimodal Integration

The codec balances cross-modal understanding with efficient speech generation, exploiting semantic layers and acoustic details to enhance contextually-aware language modeling.

Model Capacity Alignment

The architecture addresses differences in token density between modalities, leveraging multiple codebooks to optimize the balance between model capacity and information preservation. This ensures the model's autoregressive processes remain efficient without sacrificing intelligibility.

Training Strategy

A multistage training strategy is employed:

Encoder Pretraining: Exposes the model to diverse data to generalize its tokenization capacity.
Decoder Training: Utilizes high-quality data to refine audio synthesis, improving fidelity and stability.
Targeted Fine-Tuning (optional): Adjusts decoder parameters for speaker specificity or confined scenario performance.

Evaluation and Performance

Empirical evaluations demonstrate LongCat-Audio-Codec's superior performance across a range of metrics, including WER, PESQ, and STOI. The codec excels in low-bitrate scenarios, maintaining intelligibility while surpassing many traditional codecs in reconstruction quality when semantic information is included.

Figure 4: Speaker similarity improvement by Stage 2 and Stage 3.

Moreover, experiments highlight the potential of fewer codebooks in maintaining robust performance, with detailed results showcasing competitive intelligibility and acoustic metric scores even at constrained bitrates (Figure 5).

Figure 5: Potential of few codebooks.

Conclusion

LongCat-Audio-Codec establishes a new benchmark in audio tokenization for speech LLMs, balancing token efficiency with uncompromised quality. While currently designed for speech applications, future iterations aim to extend beyond current input duration limitations and enhance adaptations for music and sound effects. The modular training architecture and adaptive codebook configuration promise significant flexibility and scalability in evolving multimodal AI scenarios.