EchoX: Speech-to-Speech LLM Framework

Updated 12 September 2025

EchoX is a speech-to-speech large language model training framework that integrates acoustic and semantic learning using a novel Echo training paradigm.
Its multi-stage pipeline—combining speech-to-text, text-to-codec, and Echo training—aligns semantic and acoustic representations to enhance reasoning.
The system achieves improved question answering and speech understanding with data efficiency, using just 6,000 hours of training data.

EchoX is a speech-to-speech LLM (SLLM) training framework purpose-built to address the degradation in reasoning and knowledge performance observed when adapting text-based LLMs to spoken modalities. Its core innovation is an “Echo training” paradigm that integrates acoustic and semantic learning, leveraging pseudo speech targets generated from semantic representations, and thereby bridges the so-called acoustic-semantic gap that plagues direct SLLM training. The EchoX architecture synthesizes multi-stage training involving speech-to-text, text-to-speech tokenization, and an auxiliary alignment stage, resulting in LLMs that preserve strong reasoning and semantic abilities in the speech domain, while maintaining data efficiency (Zhang et al., 11 Sep 2025).

1. Acoustic-Semantic Gap in SLLMs

A foundational challenge addressed by EchoX is the acoustic-semantic gap. In standard SLLM training, speech tokens, which encode fine acoustic details (such as exact pronunciation), are predicted using objectives that heavily penalize minor acoustic deviations—even if the underlying semantic content is correct. By contrast, text-based LLMs train on semantically meaningful tokens and thus develop powerful semantic understanding and reasoning capabilities. When text LLMs are adapted to SLLMs, this acoustics-focused supervision misaligns internal representations, causing degraded knowledge and reasoning in the resulting speech models. The core of the problem is that, in SLLMs, matching the acoustic form is required for “correct” output, whereas in text LLMs, minor variations in textual tokens do not undermine semantic fidelity.

2. EchoX Training Paradigm

EchoX introduces a three-stage training pipeline explicitly designed to align semantic and acoustic features and mitigate the acoustic-semantic gap:

Speech-to-Text (S2T) Training: An acoustic encoder transforms speech input into semantic representations, typically using adapter-based parameter-efficient fine-tuning methods (e.g., LoRA adapters) and LLM architectures like Soundwave. The objective is to maximize semantic comprehension, resulting in high-quality textual output from speech.
Text-to-Codec (T2C) Training: A decoder-only network is trained to predict quantized speech tokens (codec units) from text. The T2C model transforms text into “unit language”—a highly compressed representation of the speech waveform—under a standard cross-entropy loss. Importantly, decoder embeddings are kept fixed to anchor the representation.
Echo Training: The EchoX innovation. The semantic hidden state $H$ (from the S2T LLM) is used to greedily generate a textual sequence $X'$ . This sequence is passed through a frozen T2C module, which outputs pseudo-speech token targets $Y'$ . The Echo decoder (initialized from the T2C decoder) is then trained to generate $Y'$ given $H$ . A feed-forward Denoising Adapter between $H$ and the Echo decoder further refines the embedding alignment. Loss functions include cross-entropy (Echo loss $L_\mathrm{Echo}$ ), a cosine-based denoising loss $L_\mathrm{Denoising}$ , and the original S2T loss $L_\mathrm{S2T}$ :

$L = L_\mathrm{Echo} + \lambda \cdot L_\mathrm{Denoising} + L_\mathrm{S2T}$

where $\lambda$ weights the denoising term.

Unlike prior approaches that rely solely on annotated speech tokens, EchoX’s Echo decoder encourages the model to map semantic representations onto acoustically plausible, but pseudo-labeled, codecs, reducing the penalty for semantically valid acoustic variation and thus aligning internal representations for semantic reasoning.

3. Model Architecture and Loss Functions

The overall architecture consists of:

Acoustic Encoder: Processes raw speech to high-level features.
Semantic Backbone (LLM core): Receives the encoder output, integrates LoRA adapters for speech adaptation, and produces hidden states $H$ .
Echo Decoder: A decoder-only module initialized from the T2C stage, responsible for generating speech tokens from $H$ during Echo training.
Denoising Adapter: A feed-forward module between $H$ and the Echo Decoder, trained to maximize the cosine similarity between the adapted state and the T2C embedding of the generated text.

The system is supervised using:

Loss Name	Formula	Role
Echo Loss ( $L_\mathrm{Echo}$ )	$\sum_i \log P(y'_i \| H, y'_{<i})$	Guides Echo decoder to pseudo speech tokens
Denoising Loss ( $L_\mathrm{Denoising}$ )	$\sum_i (1 - \mathrm{Cos}(\mathrm{Adapter}(H_i), \mathrm{Emb}(X'_i)))$	Aligns $H$ with T2C embedding via adapter
S2T Loss ( $L_\mathrm{S2T}$ )	$\sum_i \log P(x_i \| H_S, x_{<i})$	Maintains S2T semantic mapping

The final training objective is the weighted sum:

$L = L_\mathrm{Echo} + \lambda \cdot L_\mathrm{Denoising} + L_\mathrm{S2T}$

where $\lambda$ is tuned per validation performance.

4. Experimental Setup and Empirical Results

EchoX employs approximately 6,000 hours of training data drawn from diverse modalities:

ASR data: LibriSpeech, MLS (automatic speech recognition tasks)
TTS data: AudioQA-1M, SpeechInstruct, HH-RLHF-Speech (text-to-speech tasks)
Spoken Question Answering: ShareChatX, Magpie-Pro-Speech+ (knowledge-based SQA)

Model variants explored include EchoX-3B (LLaMA 3.2 backbone) and EchoX-8B (LLaMA 3.1 backbone).

Results show:

On knowledge-based QA (Llama Questions, Web Questions, TriviaQA), EchoX-3B attains 54.0, 31.6, 25.8 (average 37.1); EchoX-8B averages 46.3.
On speech-to-text, EchoX demonstrates improvements in semantic comprehension compared to baselines.
EchoX achieves these results with orders-of-magnitude less data than alternative interleaved text–speech training methods (∼6k hours vs millions).
Streaming inference for long speech outputs does not significantly degrade performance, and the use of unit language for compact representation reduces error accumulation on long sequences.

5. Comparative Analysis and Implications

EchoX resolves the core acoustic-semantic misalignment by constructing an auxiliary path for semantic representations to be mapped to plausible speech tokens, but without harshly penalizing acoustically different—yet semantically correct—realizations. This contrasts with previous models that require speech token prediction to match annotated targets exactly, inevitably lowering reasoning and factual abilities due to overemphasis on fidelity to acoustic form.

The method’s modular design allows additional refinements—e.g., further signal processing, more advanced denoising or streaming strategies, or improvements to the naturalness of codec-to-speech synthesis. The release of code and supplementary materials supports further benchmarking and community-driven extension.

A plausible implication is that EchoX’s training paradigm could serve as a template for other cross-modal LLM architectures where modality misalignment hampers knowledge transfer.

6. Summary and Future Research Directions

EchoX is a rigorously evaluated SLLM training system that addresses the acoustic-semantic gap inherent in mapping text-based LLMs to the spoken domain. Its three-stage protocol—combining S2T learning, T2C-based speech tokenization, and Echo training for semantic–acoustic alignment—yields strong empirical performance in knowledge-based question answering and speech-to-text understanding, while requiring modest training data relative to previous approaches.

Future research directions indicated include further paper of advanced denoising approaches, optimized streaming generation for very long utterances, and continued refinement in the naturalness and variability of speech output generation via the codec model. The EchoX public repository (https://github.com/FreedomIntelligence/EchoX) provides resources for ongoing experimentation and comparative studies.

Table: EchoX Training Pipeline Overview

Stage	Input	Model Component	Target Output	Loss Function
S2T	Speech	Acoustic Encoder + LLM	Text	$L_\mathrm{S2T}$
T2C	Text	T2C Decoder	Quantized Speech	Cross-Entropy
Echo	S2T Hidden	Denoising Adapter + Echo Decoder	Pseudo Speech Tokens (via T2C)	$L_\mathrm{Echo}$ , $L_\mathrm{Denoising}$

This precise alignment of internal LLM representations between semantically rich and acoustically accurate subspaces facilitates preservation of reasoning in speech-based models at lower data and computation cost (Zhang et al., 11 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs (2025)

EchoX: Speech-to-Speech LLM Framework

1. Acoustic-Semantic Gap in SLLMs

2. EchoX Training Paradigm

3. Model Architecture and Loss Functions

4. Experimental Setup and Empirical Results

5. Comparative Analysis and Implications

6. Summary and Future Research Directions

Whiteboard

Follow Topic

Continue Learning

EchoX: Speech-to-Speech LLM Framework

1. Acoustic-Semantic Gap in SLLMs

2. EchoX Training Paradigm

3. Model Architecture and Loss Functions

4. Experimental Setup and Empirical Results

5. Comparative Analysis and Implications

6. Summary and Future Research Directions

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics