Speech-to-Speech LLMs

Updated 12 September 2025

Speech-to-Speech LLMs are neural architectures that directly process and generate spoken language by aligning acoustic features with text embeddings for robust semantic reasoning.
They employ stacked conformer encoders, attention mechanisms, and interleaved training techniques to optimize modality alignment and cross-modal generalization.
These models enhance voice assistants and dialogue systems by integrating advanced speech synthesis, multilingual capabilities, and rigorous benchmarking strategies.

Speech-to-Speech LLMs (SLLMs) are a class of neural LLMs that process, understand, and generate spoken language directly in both input and output modalities. Unlike traditional systems based on cascades of automatic speech recognition (ASR), natural language processing, and text-to-speech synthesis (TTS), SLLMs aim for end-to-end or tightly integrated architectures capable of robust semantic reasoning, acoustic analysis, and spoken language generation. These models are increasingly foundational for advanced voice assistants, automated spoken dialogue systems, and multimodal conversational AI.

1. Architectural Principles and Modality Alignment

Fundamental to SLLMs is the alignment of acoustic and linguistic representations to facilitate seamless information transfer and reduce information loss between modalities. Early efforts focused on the use of CTC-trained encoders with blank-filtering to compress frame-level speech representations, retaining only semantically relevant frames and aligning sequence length and semantics with text tokens (Wang et al., 2023). Speech input $X$ is encoded and filtered to produce $X_S$ , which, after a self-attention–based Speech2Text Adapter, is mapped into the embedding space of a downstream LLM: $Y = \text{Adapter}(X_S)$ where Adapter employs attention mechanisms to contextualize the filtered speech frames.

Recent SLLM systems often employ stacked conformer encoders for extracting acoustic embeddings at various temporal resolutions. These embeddings are then projected and prepended to the LLM's token sequence to enable direct speech-conditioned prediction (Fathullah et al., 2023). This mechanism is typically formalized as: $X = [a_1, a_2, \ldots, a_k, t_1, t_2, \ldots, t_m]$ where $a_i$ are audio embeddings and $t_j$ are text embeddings. The direct prepending, coupled with shared embedding dimensions, enables the LLM to reason across speech tokens with minimal architectural modification.

2. Speech Synthesis and Generation Techniques

Augmenting LLMs with speech synthesis capabilities leverages codec-based TTS models such as VALL-E in coupled or superposed configurations. There are three principal integration paradigms (Hao et al., 2023):

Direct Fine-tuning: The LLM is trained to predict sequences of speech codec tokens. Parameter-efficient LoRA fine-tuning is typically less effective for this due to the high density and unfamiliar distribution of codec tokens relative to text.
Superposed Layers: The LLM receives both text and acoustic tokens, producing continuous representations that are projected to match the TTS model's input space.
Coupled (Cascaded) Approach: The LLM serves solely as a semantic text encoder; its output is projected and provided as conditioning input for the autoregressive codec model (e.g., VALL-E), which handles all acoustic generation.

Experimental evidence indicates that the decoupled, "coupled" strategy yields the lowest error rates and highest speaker similarity and naturalness scores; e.g., a 10.9% average reduction in WER and consistent improvements in speaker similarity for LLaMA-7B when used as a text encoder (Hao et al., 2023).

3. Multilingual and Multitask Learning

Scaling SLLMs across languages and tasks introduces additional modality alignment and data scarcity problems, particularly for "non-core" languages with limited paired speech-text resources. Solutions combine chain-of-thought (CoT) reasoning with cross-lingual transfer (Xue et al., 29 Apr 2025):

XS-CoT Framework: Speech is transcribed, translated into a core language (usually English), reasoning is performed in this core language, and then the response is translated back and synthesized in the target language. Explicit generation of four token types (instruction/response in both languages) enables the framework to retain reasoning quality in low-resource settings. Semi-implicit CoT compression governs inference latency by dynamically compressing chain-of-thought tokens without sacrificing logical structure.

This strategy yields up to a 45% improvement in GPT-4–scored response quality for non-core languages and more than a 50% reduction in token delay.

4. Learning Strategies, Generalization, and Interleaved Training

Data-efficient generalization and improved modality transfer remain critical. Key approaches include:

Multi-task Behavior Imitation (MTBI): The SLLM is trained to imitate the responses of a powerful text LLM given the same prompt and content, either in text or speech. Multi-task setups spanning ASR, content-based tasks, and constructed tasks drive robust cross-modal alignment (Xie et al., 24 May 2025).
Speech–Text Interleaving: During training, continuous segments of transcription are replaced with their synthesized speech equivalents, which helps the model learn to align and reason across modalities, improving both task and prompt generalization.
Scheduled Interleaved Training: Modality adaptation for S2ST is achieved by progressively interleaving more speech and less text in the input/output, with scheduled decay of the text ratio. CTC-based word alignments determine replacement boundaries at each step (Futami et al., 12 Jun 2025).

Such techniques significantly boost zero-shot generalization and task transfer capabilities while substantially reducing the need for large-scale supervised speech annotations.

5. Modeling the Acoustic-Semantic Gap, Evaluation, and Benchmarking

SLLMs traditionally suffer from performance degradation (termed "intelligence degradation") due to the acoustic-semantic gap inherent in using discrete speech tokens as training targets. The EchoX approach (three-stage training) addresses this by generating pseudo acoustic targets from intermediate semantic representations, then using an auxiliary Echo decoder and denoising adapter to align the latent space between semantic and acoustic representations (Zhang et al., 11 Sep 2025). The joint loss is: $\mathcal{L} = \mathcal{L}_{\text{Echo}} + \lambda \cdot \mathcal{L}_{\text{Denoising}} + \mathcal{L}_{\text{S2T}}$ This mechanism enables SLLMs to preserve knowledge and reasoning capability while mitigating penalties for superficial acoustic mismatches.

The S2SBench framework systematically quantifies degradation by comparing model perplexity on paired plausible/implausible samples in text vs. audio input scenarios (Fang et al., 20 May 2025). Two-stage training (freezing the LLM backbone before joint finetuning) demonstrably maintains higher "intelligence" and training stability.

For model evaluation, SageLM provides multi-aspect, explainable scoring of speech responses, using rationale-based supervision and synthetic speech preference data (SpeechFeedback) to reach over 82.7% human agreement rates (Ge et al., 28 Aug 2025). This approach surpasses cascaded (ASR-to-LLM) and SLM-based baselines in both semantic and acoustic dimensions.

6. Scaling Laws and Future Directions

Comprehensive scaling analyses confirm that while SLLMs follow established power-law scaling with respect to compute and data volume, their acquisition of syntactic and semantic proficiency progresses three orders of magnitude more slowly than in text-based LLMs (Cuervo et al., 31 Mar 2024): $L(C) \propto C^{\gamma} \quad Q \propto C^{\gamma_q}$ where $L$ denotes loss, $Q$ downstream metric, $C$ compute, and $\gamma_q$ is much smaller for speech.

Accordingly, high-performing SLLMs predicated on speech-only training regimes require vastly more compute than text-centric LLMs. Research thus increasingly emphasizes hybrid initialization, transfer strategies, and synthetic data generation (e.g., sTinyStories) to compensate for this inefficiency and to boost semantic coverage.

Advances in simultaneous inference (SimulS2S-LLM) leverage boundary-aware speech prompts and incremental beam search for real-time S2ST with superior BLEU-latency trade-offs (Deng et al., 22 Apr 2025). Techniques such as dynamic compression training (Guo et al., 20 Jul 2025), iterative fusion of long sequences, and robust multi-modal retrieval architectures (Sun et al., 26 Jan 2025) are poised to further expand the capabilities and efficiency of SLLMs.

7. Evaluation Standards and Roadmapping

The SAGI Benchmark establishes a hierarchical evaluation standard, reflecting five levels from basic ASR to speech AGI (Artificial General Intelligence), assessing not only transcription but also paralinguistic cue extraction and domain-specific acoustic reasoning (Bu et al., 17 Oct 2024). This framework highlights major current limitations: inadequate paralinguistic integration, architectural underutilization of acoustic features, sensitivity to prompt formats, and backbone weaknesses for audio-like input streams.

Future work is directed toward:

Holistic integration of semantic, paralinguistic, and abstract acoustic knowledge;
Improved cross-modal fusion architectures beyond simple stacking;
Enhanced instruction-following robustness;
Efficient adaptation to streaming and simultaneous generation use cases;
Generalizable evaluative models (like SageLM) that simultaneously score and explain both semantic correctness and acoustic qualities.

In sum, SLLMs represent a convergence of advanced acoustic modeling, scalable language modeling, and multimodal reasoning, with significant ongoing progress in modality alignment, data efficiency, and evaluation standardization. Leading methodologies increasingly highlight the necessity of integrated learning objectives, dynamic training schedules, robust retrieval-augmentation, and cross-modal architectural innovations for achieving human-level or superhuman performance in end-to-end spoken language understanding and generation.