Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 63 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 32 tok/s Pro

GPT-5 High 29 tok/s Pro

GPT-4o 88 tok/s Pro

Kimi K2 152 tok/s Pro

GPT OSS 120B 325 tok/s Pro

Claude Sonnet 4.5 32 tok/s Pro

2000 character limit reached

Llama-Mimi: Speech Language Models with Interleaved Semantic and Acoustic Tokens (2509.14882v1)

Published 18 Sep 2025 in cs.CL

Abstract: We propose Llama-Mimi, a speech LLM that uses a unified tokenizer and a single Transformer decoder to jointly model sequences of interleaved semantic and acoustic tokens. Comprehensive evaluation shows that Llama-Mimi achieves state-of-the-art performance in acoustic consistency and possesses the ability to preserve speaker identity. Our analysis further demonstrates that increasing the number of quantizers improves acoustic fidelity but degrades linguistic performance, highlighting the inherent challenge of maintaining long-term coherence. We additionally introduce an LLM-as-a-Judge-based evaluation to assess the spoken content quality of generated outputs. Our models, code, and speech samples are publicly available.

Summary

The paper introduces a single-decoder Transformer that interleaves semantic and acoustic tokens using a unified tokenizer, streamlining speech generation.
It achieves state-of-the-art acoustic consistency and speaker preservation while maintaining competitive linguistic performance through autoregressive token prediction.
Experiments reveal a trade-off between audio fidelity and linguistic coherence, highlighting the impact of model size and quantizer settings on performance.

Llama-Mimi: Speech LLMs with Interleaved Semantic and Acoustic Tokens

Introduction and Motivation

Llama-Mimi presents a streamlined approach to speech LLMing by leveraging a unified tokenizer and a single Transformer decoder to jointly model interleaved semantic and acoustic tokens. This design contrasts with prior multi-stage or multi-transformer architectures, such as AudioLM and Moshi, which introduce architectural complexity and deployment challenges. The motivation is to simplify the speech generation pipeline while maintaining or improving performance in both acoustic fidelity and linguistic coherence.

Model Architecture

The core of Llama-Mimi is a single-decoder Transformer (Llama 3 backbone) that autoregressively predicts sequences of interleaved semantic and acoustic tokens. The unified tokenizer, based on Mimi, converts audio waveforms into residual vector quantizer (RVQ) tokens, with the first-level quantizers semantically distilled from WavLM. Tokens are ordered per frame, and the model predicts semantic tokens first, conditioning subsequent acoustic tokens on them within each frame. The vocabulary is extended to include audio tokens and special delimiters, enabling seamless integration of text and audio modalities.

Figure 1: Model architecture of Llama-Mimi, illustrating the unified tokenizer and single-decoder Transformer for interleaved semantic and acoustic token modeling.

This architecture allows the model to attend to all past tokens, capturing both long-term dependencies and fine-grained acoustic variations. Training follows standard next-token prediction, with inference proceeding autoregressively until the end-of-audio token is generated.

Experimental Setup

Llama-Mimi is instantiated in two sizes: 1.3B and 8B parameters. Training utilizes large-scale English speech corpora (Libri-Light, The People's Speech, VoxPopuli, Emilia), totaling approximately 240k hours. Audio is tokenized at 12.5Hz with $Q=4$ quantizers, yielding 50 tokens/sec. Models are initialized from pretrained Llama 3 checkpoints, with Mimi parameters frozen. Training is performed on 32 NVIDIA H200 GPUs using FSDP, with a global batch size of 1,024 and a maximum sequence length of 1,024 tokens.

Evaluation Methodology

Likelihood-Based Evaluation

Acoustic consistency and alignment are assessed using SALMon, which measures the model's ability to assign higher likelihoods to natural samples versus those with unnatural acoustic changes. Semantic knowledge is evaluated with sWUGGY, sBLIMP, and sTopic-StoryCloze, focusing on word validity, grammaticality, and continuation coherence. Perplexity is computed over semantic tokens, except for SALMon.

Generation-Based Evaluation

Speech generation is evaluated by prompting the model with short audio and generating continuations. Speaker consistency is measured via cosine similarity of WavLM-based embeddings. Spoken content quality is assessed using an LLM-as-a-Judge framework, where GPT-4o rates transcribed continuations for relevance, coherence, fluency, and informativeness. This approach addresses the instability of perplexity-based metrics, which are sensitive to sequence length and sampling.

Results and Analysis

Acoustic and Semantic Performance

Llama-Mimi achieves state-of-the-art results in acoustic consistency, outperforming baselines such as GSLM, TWIST, and Flow-SLM. The model's ability to attend to all tokens enables superior modeling of subtle acoustic properties and speaker identity. On semantic tasks, Llama-Mimi performs slightly below SSL-based models, attributed to longer sequence lengths that challenge global information capture. Increasing model size improves semantic performance but has minimal impact on acoustic metrics.

Speaker Consistency and Content Quality

Llama-Mimi demonstrates strong speaker preservation, with higher similarity scores than baselines. LLM-as-a-Judge scores for spoken content quality are more reliable than perplexity, with larger models achieving better ratings and ground-truth references scoring highest. This validates the effectiveness of the LLM-based evaluation framework.

Qualitative Generation

Qualitative analysis shows that the 8B model produces more natural and coherent continuations than the 1.3B variant, indicating that model capacity is critical for handling long sequences and maintaining coherence.

Quantizer Ablation

Increasing the number of quantizers ( $Q$ ) improves audio quality and speaker similarity but degrades spoken content quality. This trade-off highlights the challenge of balancing acoustic fidelity with linguistic coherence, as longer token sequences strain the model's ability to maintain global context.

Implementation Considerations

Computational Requirements: Training Llama-Mimi requires significant resources (32 H200 GPUs, large batch sizes), but the single-decoder architecture simplifies deployment compared to multi-stage pipelines.
Tokenization Strategy: Unified tokenization enables joint modeling of semantic and acoustic information, reducing the need for specialized tokenizers and facilitating end-to-end training.
Sequence Length: Longer token sequences, especially with higher quantizer counts, can hinder linguistic performance. Careful tuning of quantizer number and frame rate is necessary to balance fidelity and coherence.
Evaluation: LLM-as-a-Judge provides a robust alternative to perplexity for spoken content quality, especially in open-ended generation scenarios.

Implications and Future Directions

Llama-Mimi demonstrates that a single-decoder Transformer with unified tokenization can achieve competitive or superior performance in speech generation tasks, simplifying the architecture and deployment. The observed trade-off between acoustic fidelity and linguistic coherence suggests future work in dynamic token allocation, adaptive quantization, or hierarchical attention mechanisms to mitigate sequence length challenges. The LLM-as-a-Judge evaluation framework may become standard for assessing generative spoken content, given its reliability and alignment with human judgment.

Conclusion

Llama-Mimi advances speech LLMing by unifying semantic and acoustic token processing within a single Transformer decoder. The model achieves state-of-the-art acoustic consistency and speaker preservation, with a clear trade-off between audio fidelity and long-term coherence. The adoption of LLM-as-a-Judge for spoken content evaluation sets a precedent for future benchmarking. This work paves the way for more efficient, scalable, and robust speech generation systems, with potential extensions in multilingual, multimodal, and streaming applications.