Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 83 tok/s
Gemini 2.5 Pro 34 tok/s Pro
GPT-5 Medium 40 tok/s Pro
GPT-5 High 33 tok/s Pro
GPT-4o 115 tok/s Pro
Kimi K2 175 tok/s Pro
GPT OSS 120B 474 tok/s Pro
Claude Sonnet 4 40 tok/s Pro
2000 character limit reached

EchoX: Speech-to-Speech LLM Framework

Updated 12 September 2025
  • EchoX is a speech-to-speech large language model training framework that integrates acoustic and semantic learning using a novel Echo training paradigm.
  • Its multi-stage pipeline—combining speech-to-text, text-to-codec, and Echo training—aligns semantic and acoustic representations to enhance reasoning.
  • The system achieves improved question answering and speech understanding with data efficiency, using just 6,000 hours of training data.

EchoX is a speech-to-speech LLM (SLLM) training framework purpose-built to address the degradation in reasoning and knowledge performance observed when adapting text-based LLMs to spoken modalities. Its core innovation is an “Echo training” paradigm that integrates acoustic and semantic learning, leveraging pseudo speech targets generated from semantic representations, and thereby bridges the so-called acoustic-semantic gap that plagues direct SLLM training. The EchoX architecture synthesizes multi-stage training involving speech-to-text, text-to-speech tokenization, and an auxiliary alignment stage, resulting in LLMs that preserve strong reasoning and semantic abilities in the speech domain, while maintaining data efficiency (Zhang et al., 11 Sep 2025).

1. Acoustic-Semantic Gap in SLLMs

A foundational challenge addressed by EchoX is the acoustic-semantic gap. In standard SLLM training, speech tokens, which encode fine acoustic details (such as exact pronunciation), are predicted using objectives that heavily penalize minor acoustic deviations—even if the underlying semantic content is correct. By contrast, text-based LLMs train on semantically meaningful tokens and thus develop powerful semantic understanding and reasoning capabilities. When text LLMs are adapted to SLLMs, this acoustics-focused supervision misaligns internal representations, causing degraded knowledge and reasoning in the resulting speech models. The core of the problem is that, in SLLMs, matching the acoustic form is required for “correct” output, whereas in text LLMs, minor variations in textual tokens do not undermine semantic fidelity.

2. EchoX Training Paradigm

EchoX introduces a three-stage training pipeline explicitly designed to align semantic and acoustic features and mitigate the acoustic-semantic gap:

  1. Speech-to-Text (S2T) Training: An acoustic encoder transforms speech input into semantic representations, typically using adapter-based parameter-efficient fine-tuning methods (e.g., LoRA adapters) and LLM architectures like Soundwave. The objective is to maximize semantic comprehension, resulting in high-quality textual output from speech.
  2. Text-to-Codec (T2C) Training: A decoder-only network is trained to predict quantized speech tokens (codec units) from text. The T2C model transforms text into “unit language”—a highly compressed representation of the speech waveform—under a standard cross-entropy loss. Importantly, decoder embeddings are kept fixed to anchor the representation.
  3. Echo Training: The EchoX innovation. The semantic hidden state HH (from the S2T LLM) is used to greedily generate a textual sequence XX'. This sequence is passed through a frozen T2C module, which outputs pseudo-speech token targets YY'. The Echo decoder (initialized from the T2C decoder) is then trained to generate YY' given HH. A feed-forward Denoising Adapter between HH and the Echo decoder further refines the embedding alignment. Loss functions include cross-entropy (Echo loss LEchoL_\mathrm{Echo}), a cosine-based denoising loss LDenoisingL_\mathrm{Denoising}, and the original S2T loss LS2TL_\mathrm{S2T}:

L=LEcho+λLDenoising+LS2TL = L_\mathrm{Echo} + \lambda \cdot L_\mathrm{Denoising} + L_\mathrm{S2T}

where λ\lambda weights the denoising term.

Unlike prior approaches that rely solely on annotated speech tokens, EchoX’s Echo decoder encourages the model to map semantic representations onto acoustically plausible, but pseudo-labeled, codecs, reducing the penalty for semantically valid acoustic variation and thus aligning internal representations for semantic reasoning.

3. Model Architecture and Loss Functions

The overall architecture consists of:

  • Acoustic Encoder: Processes raw speech to high-level features.
  • Semantic Backbone (LLM core): Receives the encoder output, integrates LoRA adapters for speech adaptation, and produces hidden states HH.
  • Echo Decoder: A decoder-only module initialized from the T2C stage, responsible for generating speech tokens from HH during Echo training.
  • Denoising Adapter: A feed-forward module between HH and the Echo Decoder, trained to maximize the cosine similarity between the adapted state and the T2C embedding of the generated text.

The system is supervised using:

Loss Name Formula Role
Echo Loss (LEchoL_\mathrm{Echo}) ilogP(yiH,y<i)\sum_i \log P(y'_i | H, y'_{<i}) Guides Echo decoder to pseudo speech tokens
Denoising Loss (LDenoisingL_\mathrm{Denoising}) i(1Cos(Adapter(Hi),Emb(Xi)))\sum_i (1 - \mathrm{Cos}(\mathrm{Adapter}(H_i), \mathrm{Emb}(X'_i))) Aligns HH with T2C embedding via adapter
S2T Loss (LS2TL_\mathrm{S2T}) ilogP(xiHS,x<i)\sum_i \log P(x_i | H_S, x_{<i}) Maintains S2T semantic mapping

The final training objective is the weighted sum:

L=LEcho+λLDenoising+LS2TL = L_\mathrm{Echo} + \lambda \cdot L_\mathrm{Denoising} + L_\mathrm{S2T}

where λ\lambda is tuned per validation performance.

4. Experimental Setup and Empirical Results

EchoX employs approximately 6,000 hours of training data drawn from diverse modalities:

  • ASR data: LibriSpeech, MLS (automatic speech recognition tasks)
  • TTS data: AudioQA-1M, SpeechInstruct, HH-RLHF-Speech (text-to-speech tasks)
  • Spoken Question Answering: ShareChatX, Magpie-Pro-Speech+ (knowledge-based SQA)

Model variants explored include EchoX-3B (LLaMA 3.2 backbone) and EchoX-8B (LLaMA 3.1 backbone).

Results show:

  • On knowledge-based QA (Llama Questions, Web Questions, TriviaQA), EchoX-3B attains 54.0, 31.6, 25.8 (average 37.1); EchoX-8B averages 46.3.
  • On speech-to-text, EchoX demonstrates improvements in semantic comprehension compared to baselines.
  • EchoX achieves these results with orders-of-magnitude less data than alternative interleaved text–speech training methods (∼6k hours vs millions).
  • Streaming inference for long speech outputs does not significantly degrade performance, and the use of unit language for compact representation reduces error accumulation on long sequences.

5. Comparative Analysis and Implications

EchoX resolves the core acoustic-semantic misalignment by constructing an auxiliary path for semantic representations to be mapped to plausible speech tokens, but without harshly penalizing acoustically different—yet semantically correct—realizations. This contrasts with previous models that require speech token prediction to match annotated targets exactly, inevitably lowering reasoning and factual abilities due to overemphasis on fidelity to acoustic form.

The method’s modular design allows additional refinements—e.g., further signal processing, more advanced denoising or streaming strategies, or improvements to the naturalness of codec-to-speech synthesis. The release of code and supplementary materials supports further benchmarking and community-driven extension.

A plausible implication is that EchoX’s training paradigm could serve as a template for other cross-modal LLM architectures where modality misalignment hampers knowledge transfer.

6. Summary and Future Research Directions

EchoX is a rigorously evaluated SLLM training system that addresses the acoustic-semantic gap inherent in mapping text-based LLMs to the spoken domain. Its three-stage protocol—combining S2T learning, T2C-based speech tokenization, and Echo training for semantic–acoustic alignment—yields strong empirical performance in knowledge-based question answering and speech-to-text understanding, while requiring modest training data relative to previous approaches.

Future research directions indicated include further paper of advanced denoising approaches, optimized streaming generation for very long utterances, and continued refinement in the naturalness and variability of speech output generation via the codec model. The EchoX public repository (https://github.com/FreedomIntelligence/EchoX) provides resources for ongoing experimentation and comparative studies.


Table: EchoX Training Pipeline Overview

Stage Input Model Component Target Output Loss Function
S2T Speech Acoustic Encoder + LLM Text LS2TL_\mathrm{S2T}
T2C Text T2C Decoder Quantized Speech Cross-Entropy
Echo S2T Hidden Denoising Adapter + Echo Decoder Pseudo Speech Tokens (via T2C) LEchoL_\mathrm{Echo}, LDenoisingL_\mathrm{Denoising}

This precise alignment of internal LLM representations between semantically rich and acoustically accurate subspaces facilitates preservation of reasoning in speech-based models at lower data and computation cost (Zhang et al., 11 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)