XS-CoT: Cross-lingual Speech Chain-of-Thought

Updated 8 December 2025

XS-CoT is a modular framework that integrates cross-lingual chain-of-thought reasoning and token compression to enhance instruction-following in low-resource languages.
It employs a structured pipeline that transcribes non-core language speech, translates to core language for reasoning, and re-translates the response back to the target language.
Empirical results show up to a 50.5% reduction in inference latency with a balanced trade-off in performance, significantly improving instruction quality.

The semi-implicit Cross-lingual Speech Chain-of-Thought (XS-CoT) framework is an approach for enhancing non-core language instruction-following in Speech LLMs (SLLMs) by integrating cross-lingual chain-of-thought (CoT) reasoning and token compression. XS-CoT leverages a structured pipeline involving speech-to-text translation, core-language reasoning, and targeted token compression to address the scarcity of high-quality non-core language speech-text data and the limited multilingual reasoning capabilities characteristic of existing SLLMs. Empirical results demonstrate substantial improvements in instruction-following quality and inference latency, particularly in low-resource (non-core) languages, by exploiting the robust reasoning capacity of core-LLMs (Xue et al., 29 Apr 2025).

1. System Architecture

XS-CoT is built on a modular stack consisting of a speech encoder, modal adapter, and LLM. The processing pipeline operates as follows:

Input: Speech in a non-core (target) language, denoted $n$ (e.g., Japanese, German, French).
Stage 1 (Modal Alignment): The speech encoder and adapter transcribe speech into a text instruction $x^{\rm instr}_n$ in the same language.
Stage 2 (XS-CoT Fine-tuning): The pipeline translates $x^{\rm instr}_n$ to a core-language (English) instruction $x^{\rm instr}_c$ , performs chain-of-thought reasoning in English to produce $x^{\rm resp}_c$ , then translates the English response back into the target language as $x^{\rm resp}_n$ .
Stage 3 (Semi-Implicit CoT Compression): The first three token types are progressively compressed, producing only a compact sketch of the reasoning chain during training to reduce inference latency, while retaining the global reasoning logic.

The pipeline is formalized as follows:

$\text{speech}_n \rightarrow x^{\rm instr}_n \rightarrow x^{\rm instr}_c \rightarrow x^{\rm resp}_c \rightarrow x^{\rm resp}_n$

where each transformation involves explicit sequence generation.

2. Token Typology and Cross-Lingual Transfer

XS-CoT introduces four explicit token types to facilitate cross-lingual reasoning transfer:

Token Type	Notation	Function
Target-language Instruction	$x^{\rm instr}_n$	Aligns speech-encoder output to textual instruction in target
Core-language Instruction	$x^{\rm instr}_c$	Enables leveraging the LLM's reasoning strength in core language (English)
Core-language Response	$x^{\rm resp}_c$	Encodes the English chain-of-thought reasoning and provisional answer
Target-language Response	$x^{\rm resp}_n$	Translates the English reasoning and answer to the target language

The full output sequence is

$x = [x^{\rm instr}_n,\, x^{\rm instr}_c,\, x^{\rm resp}_c,\, x^{\rm resp}_n]$

By sandwiching the English reasoning within bidirectional translations, XS-CoT transfers core-language reasoning to low-resource domains.

3. Semi-Implicit Chain-of-Thought Compression

To address the significant inference latency incurred by long core-language CoT chains (typically exceeding 100 tokens), XS-CoT employs a semi-implicit compression mechanism:

The English CoT response $R = [s_1, s_2, ..., s_x]$ is partitioned into $x$ sentences.
Each sentence $s_i$ is further divided into word-groups; only the first $k$ groups are retained, followed by an ellipsis to indicate omission.
At training epoch $n$ , the first $m(n)$ sentences are compressed, where

$m(n) = \begin{cases} \min(x,n) & n < m_0 \ x & n \ge m_0 \end{cases}$

The compression operator $c_k$ applied to a sentence $s$ is defined as

$c_k(s) = [w_1, ..., w_k, ...] \quad \text{(where } w_j \text{ are word-groups, followed by “…”)}$

The output at epoch $n$ becomes

$c^{(n)}(R) = [c_k(s_1), ..., c_k(s_{m(n)}), s_{m(n)+1}, ..., s_x]$

The training objective remains the standard next-token log-likelihood across all four token types:

$\mathcal{L}(\theta) = -\sum_{t=1}^{|x|} \log p_\theta\bigl(x_t \mid x_{<t},\, \text{speech}\bigr)$

No reconstruction loss is required; the progressive compression guides the model to infer missing reasoning details.

4. Inference Workflow and Latency Analysis

During inference, XS-CoT proceeds in four explicit decoding phases:

Decode $x^{\rm instr}_n$ from speech.
Decode $x^{\rm instr}_c$ , conditioned on $x^{\rm instr}_n$ .
Decode a compressed CoT chain $c^{(\infty)}(x^{\rm resp}_c)$ , conditioned on previous outputs, where all sentences are fully compressed.
Decode $x^{\rm resp}_n$ from the compressed CoT representation.

Because the majority of intermediate English tokens are compressed—each sentence retaining only $k$ word-groups—the delay before producing target-language responses is significantly reduced. Empirical measurements indicate that for a full chain of $D_{\rm full} \approx 107$ tokens, semi-implicit compression yields $D_{\rm semi} \approx 53$ , with a token delay reduction of approximately 50.5%. The compression ratio is defined as

$\rho = \frac{D_{\rm semi}}{D_{\rm full}}, \quad \Delta = 1-\rho$

where $\Delta \approx 0.505$ quantifies the delay reduction.

5. Data Pipeline and Resource Utilization

XS-CoT employs the “Multilingual Alpaca Speech” dataset, constructed for English (60K samples), Japanese (30K), French (10K), and German (10K). Each training sample is generated via:

Extraction from Stanford Alpaca text instructions.
Filtering for noise and quality.
Translation to the target language.
Synthetic speech generation using fish-speech TTS.
Whisper ASR filtering (word error rate < 5%).

Crucially, the XS-CoT approach requires only a modest number of non-core language instruction-response speech examples. By leveraging core-language reasoning, the framework transfers chain-of-thought capability efficiently, achieving 2–3x higher sample efficiency over direct supervised fine-tuning.

6. Empirical Results and Trade-offs

XS-CoT delivers marked improvements in instruction-following quality for non-core languages, as measured by GPT-4 evaluation scores on Japanese test sets (OpenHermes & ALPACA):

Model	Direct SFT	XS-CoT	Absolute Gain	Relative Gain	Tokens (full)	Tokens (semi)
SALMONN-JA	28.4	50.3	+21.9	+77%	107	53

For SLLMs averaging over SALMONN and Qwen2Audio, XS-CoT achieves a 45% relative improvement over direct supervised fine-tuning. A comparative trade-off between compression and performance is observed:

Method	GPT-4 score	CoT tokens	Delay reduction
Full XS-CoT	50.3	107	–
Semi-Implicit XS-CoT	43.0	53	50.5%

Adopting the semi-implicit scheme saves roughly half the reasoning-token latency with only a 14.5% drop in GPT-4 score, representing a competitive balance between speed and answer quality (Xue et al., 29 Apr 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Enhancing Non-Core Language Instruction-Following in Speech LLMs via Semi-Implicit Cross-Lingual CoT Reasoning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to XS-CoT Framework.