Spoken Language Models (SLMs): Fundamentals and Advances
Spoken LLMs (SLMs) are machine learning models that process, understand, and generate human language through the speech modality, operating directly on acoustic signals rather than purely textual data. SLMs encompass a range of architectures and methodologies unified by the aim of modeling the distribution of spoken language, both for understanding (speech as input) and for generation (speech as output), often integrating non-linguistic features such as prosody, emotion, and speaker identity. The domain has evolved rapidly, with innovations in model architectures, tokenization, training strategies, and benchmarks, and is now central to the development of universal, instruction-following, and robust multimodal AI agents.
1. Fundamental Principles and Architectural Categories
SLMs can be broadly categorized based on their architecture and modality handling:
- Pure Speech LLMs: These models learn from tokenized speech sequences without relying on textual resources, optimizing autoregressive next-token prediction over discrete units derived from the speech signal. Noteworthy early models include GSLM, AudioLM, and TWIST. The trend toward "textless" (speech-only) SLMs enables LLMing and generation in settings where text data or orthography is unavailable (Cuervo et al., 31 Mar 2024 , Arora et al., 11 Apr 2025 ).
- Speech+Text Models: These models are designed to jointly model both speech and text, often by interleaving token sequences from both modalities. They enable transfer and sharing of semantic knowledge from pre-trained textual LLMs, significantly improving language understanding and reducing data requirements. Examples include SpiRit-LM and Moshi. Recent scaling analyses demonstrate that interleaving modalities, especially when SLMs are initialized from large text LLMs, results in significantly improved efficiency and performance compared to training on speech alone (Maimon et al., 3 Apr 2025 ).
- Speech-Aware Text LMs: These models integrate a speech encoder and modality adapter into a pre-trained LLM backbone, enabling text-based LLMs to accept speech as input. Architectures such as Qwen-Audio-Chat and SALMONN leverage pre-trained linguistic knowledge for rapid adaptation to speech, facilitating strong instruction-following and open-domain capabilities even in cross-modal tasks (Arora et al., 11 Apr 2025 , Lu et al., 27 Jun 2024 ).
All model categories commonly employ a discrete speech tokenizer, sequence model (often a Transformer), and a vocoder or unit-based speech synthesizer for waveform generation.
2. Speech Tokenization: Techniques and Impact
Speech tokenization is fundamental to SLMs, converting raw audio into sequences of discrete units suitable for LLMing. Techniques include:
- Frame-based and Pooled Segmentation: Continuous features (e.g., from HuBERT) are segmented into fixed-width (e.g., 80 ms) or variable-width intervals (e.g., phoneme, syllable, or word boundaries), and pooled prior to clustering (Kando et al., 23 May 2025 ). Efficient tokenization uses moderately coarse segmentation with large vocabulary size (e.g., 16,384 clusters with 80 ms segments), balancing information content with computational tractability.
- Clustering and Quantization: K-means and differentiable clustering methods assign tokens; larger codebooks retain expressive power with coarser units (Visser et al., 29 May 2025 , Chang et al., 31 Oct 2024 ).
- Speaker-Invariant Tokenization: Approaches like DC-Spin employ dual codebooks and speaker perturbation to ensure tokens are rich in phonetic content yet insensitive to speaker or channel variation. This enhances robustness, transferability, and resynthesis quality (Chang et al., 31 Oct 2024 ).
- Decoupled Tokenization: Recent advances decouple speech tokens into semantic and acoustic streams, aligning semantic tokens with text and modeling prosody/timbre separately. This modularity improves cross-modal alignment and synthesis quality (Fan et al., 14 Jun 2025 ).
Tokenization choices directly impact downstream performance and efficiency. Coarser units can enable lower runtime and shorter sequences, which is advantageous for LLMing and speech resynthesis at the sentence level; finer units are preferred for phonetic discrimination (Kando et al., 23 May 2025 , Visser et al., 29 May 2025 ).
3. Training Strategies and Scaling Properties
SLMs are trained through large-scale self-supervised or supervised learning on extensive audio corpora, often augmented with synthetic data generated from TTS or LLMs:
- Next-token Prediction: The dominant objective is predicting the next speech token (or token pair for dual-channel models), analogous to text LLMs. In dual-channel dialogue, Next-Token-Pair Prediction (NTPP) jointly models parallel speaker streams, enabling rich modeling of turn-taking and overlap (Wang et al., 1 Jun 2025 ).
- Speech-Text Interleaving and Knowledge Transfer: Initializing SLMs from pre-trained LLMs and interleaving speech/text tokens allows direct knowledge transfer, accelerating learning and reducing both compute and data requirements by an order of magnitude compared to "textless" SLMs (Maimon et al., 3 Apr 2025 ).
- Synthetic Data: Context-rich synthetic spoken datasets (e.g., sTinyStories) are shown to boost semantic performance, outperforming much larger natural speech corpora when designed to fit model context windows (Cuervo et al., 31 Mar 2024 ).
- Curriculum and Preference Optimization: Reinforcement learning from AI feedback (RLAIF) and Direct Preference Optimization (DPO) optimize for semantic quality by using LLM-assessed preference pairs, improving coherence in long-range generation (Lin et al., 4 Nov 2024 ).
- Mitigating Catastrophic Forgetting: Continual multi-task adaptation can degrade previously learned abilities. Experience replay—mixing prior stages' data during new task fine-tuning—is the most effective strategy to preserve general language capabilities alongside new skills (Hsiao et al., 23 May 2025 ).
Scaling analyses indicate SLMs follow power-law loss trends but require significantly more compute (up to 1000×) than text LLMs for the same downstream improvements. Interleaved SLMs, by contrast, scale far more efficiently and reach state-of-the-art performance with less data (Cuervo et al., 31 Mar 2024 , Maimon et al., 3 Apr 2025 ).
4. Evaluation Frameworks and Benchmarks
Evaluation of SLMs encompasses:
- Linguistic Benchmarks: ZeroSpeech sBLIMP (syntactic), sWUGGY (lexical), StoryCloze (semantic) tasks assess the model's ability to distinguish correct and incorrect spoken utterances.
- Generation and Comprehension: Dynamic-SUPERB, AIR-Bench, and VoiceBench assess instruction following, speech comprehension, and task completion via both content and paralinguistic signals (Lu et al., 27 Jun 2024 , Lu et al., 30 Sep 2024 ).
- Conversational Quality: Metrics include turn-taking, overlap, response coherence, and human or LLM-based MOS (mean opinion score) evaluations, especially in dual-channel dialogue (Wang et al., 1 Jun 2025 ).
- Automatic Judging via ALLMs: Audio-aware LLMs (ALLMs, e.g., Gemini-2.5-pro, GPT-4o-audio) reliably evaluate SLMs for speaking style, naturalness, and realism, exhibiting inter-rater agreement comparable or superior to that among human raters (Chiang et al., 6 Jun 2025 ).
- Knowledge QA in Speech Format: VoxEval rigorously benchmarks knowledge reasoning via audio-only multiple choice Q&A, highlighting current SLM limitations in knowledge extraction and robustness to auditory variations (Cui et al., 9 Jan 2025 ).
5. Representation of Suprasegmental and Paralinguistic Features
SLMs encode not only lexical and syntactic content but also suprasegmental features (tone, stress, intonation) and paralinguistic cues (emotion, prosody, speaker traits):
- Encoding of Lexical Tone: Self-supervised SLMs can encode lexical tone distinctions even when trained on non-tonal language data. Representations learned via contrastive objectives embed rich suprasegmental information, with robustness enhanced if trained or fine-tuned on tonal languages (Shen et al., 25 Mar 2024 ).
- Instruction and Reasoning Generalization: Alignment with text LLMs or leveraging descriptive alignment/captioning enables instruction-following, formatting specificity, and chain-of-thought reasoning from speech input, often without task-specific tuning (Lu et al., 27 Jun 2024 , Lu et al., 30 Sep 2024 ).
- Challenges in Paralinguistic Control: While SLMs can generate expressive speech, current models struggle with fine-grained style control, intra-utterance changes, and realistic multi-turn style adaptation (Chiang et al., 6 Jun 2025 ).
6. Current Limitations, Security, and Research Frontiers
Despite rapid progress, SLMs face several challenges:
- Semantic Gaps: SLMs lag text LLMs in semantic coherence due to information sparsity, longer sequence lengths, and paralinguistic variability increasing lexical ambiguity (Wang et al., 22 Dec 2024 ).
- Security and Robustness: SLMs are acutely vulnerable to audio-based adversarial jailbreak attacks, which bypass safety mechanisms with imperceptible perturbations. Post-hoc activation patching of network activations—particularly in the LLM component—provides robust defense (up to 99% success rate) with negligible utility loss, without retraining (Djanibekov et al., 18 May 2025 ).
- Efficiency Bottlenecks: Sequence length and tokenization strategy strongly affect efficiency. Moderately coarse segmentation with large vocabularies enables significant runtime savings without accuracy loss (Kando et al., 23 May 2025 , Visser et al., 29 May 2025 ). Multi-token prediction and decoupled tokenization further accelerate decoding and improve alignment (Fan et al., 14 Jun 2025 ).
- Data Bottleneck: Lack of large, diverse open speech dialogue corpora and instruction/role-based benchmarks (e.g., J-CHAT in Japanese (Nakata et al., 22 Jul 2024 ), RoleTriviaQA for speaker-aware QA (Fan et al., 14 Jun 2025 )) limits expansion beyond English and narrow domains.
- Evaluation and Standardization: Benchmark diversity and inconsistent reporting complicate comparative assessment; community-driven, standardized benchmarks remain a key need (Arora et al., 11 Apr 2025 ).
7. Key Research Directions and Applications
Ongoing and proposed efforts focus on:
- Open-source scaling and reproducibility: Release of models, data, and code lowers barriers for rigorous research and cross-laboratory comparison (Maimon et al., 3 Apr 2025 ).
- Hybrid and Interleaved Architectures: Leveraging text-based pre-training with speech-text interleaving for rapid scaling and transfer.
- Enhanced Speech Tokenization: Development of speaker-invariant, decoupled, and multi-resolution tokenizers to balance acuity and efficiency.
- Advances in Interactive Dialogue Modeling: Dual-channel and decoder-only approaches (e.g., Next-Token-Pair Prediction) for natural, real-time and speaker-independent multi-speaker dialogue (Wang et al., 1 Jun 2025 ).
- Automatic and Scalable Evaluation: Use of ALLMs for speaking style and paralinguistic evaluation, and continued improvement in automatic naturalness/realism scoring (Chiang et al., 6 Jun 2025 ).
- Security and Safety Mechanisms: Adversarial robustness mechanisms tailored for speech modality, addressing the expanded attack surface (Djanibekov et al., 18 May 2025 ).
- Multilingual, Multimodal, Low-resource Expansion: Extension of SLMs to a broader array of languages, conversational genres, and accessibility applications through proven methods in unsupervised, synthetic, and aligned training.
Conclusion
Spoken LLMs are a foundational technology for human-centric, robust, and instruction-following AI systems that operate over the full richness of spoken communication. The field has advanced from speech-only, self-supervised modeling to hybrid, deeply aligned architectures benefiting from LLM-derived linguistic priors, efficient tokenization, scalable training, and rigorous evaluation. Emerging best practices emphasize speaker-invariance, information-efficient tokenization, modular generation, and the use of ALLMs for scalable evaluation. Limitations in semantic depth, security, fine-grained expressiveness, and evaluation persist, guiding current and future research toward more universal and human-like spoken language understanding and generation.