Papers
Topics
Authors
Recent
Search
2000 character limit reached

XS-CoT: Cross-lingual Speech Chain-of-Thought

Updated 8 December 2025
  • XS-CoT is a modular framework that integrates cross-lingual chain-of-thought reasoning and token compression to enhance instruction-following in low-resource languages.
  • It employs a structured pipeline that transcribes non-core language speech, translates to core language for reasoning, and re-translates the response back to the target language.
  • Empirical results show up to a 50.5% reduction in inference latency with a balanced trade-off in performance, significantly improving instruction quality.

The semi-implicit Cross-lingual Speech Chain-of-Thought (XS-CoT) framework is an approach for enhancing non-core language instruction-following in Speech LLMs (SLLMs) by integrating cross-lingual chain-of-thought (CoT) reasoning and token compression. XS-CoT leverages a structured pipeline involving speech-to-text translation, core-language reasoning, and targeted token compression to address the scarcity of high-quality non-core language speech-text data and the limited multilingual reasoning capabilities characteristic of existing SLLMs. Empirical results demonstrate substantial improvements in instruction-following quality and inference latency, particularly in low-resource (non-core) languages, by exploiting the robust reasoning capacity of core-LLMs (Xue et al., 29 Apr 2025).

1. System Architecture

XS-CoT is built on a modular stack consisting of a speech encoder, modal adapter, and LLM. The processing pipeline operates as follows:

  • Input: Speech in a non-core (target) language, denoted nn (e.g., Japanese, German, French).
  • Stage 1 (Modal Alignment): The speech encoder and adapter transcribe speech into a text instruction xninstrx^{\rm instr}_n in the same language.
  • Stage 2 (XS-CoT Fine-tuning): The pipeline translates xninstrx^{\rm instr}_n to a core-language (English) instruction xcinstrx^{\rm instr}_c, performs chain-of-thought reasoning in English to produce xcrespx^{\rm resp}_c, then translates the English response back into the target language as xnrespx^{\rm resp}_n.
  • Stage 3 (Semi-Implicit CoT Compression): The first three token types are progressively compressed, producing only a compact sketch of the reasoning chain during training to reduce inference latency, while retaining the global reasoning logic.

The pipeline is formalized as follows:

speechnxninstrxcinstrxcrespxnresp\text{speech}_n \rightarrow x^{\rm instr}_n \rightarrow x^{\rm instr}_c \rightarrow x^{\rm resp}_c \rightarrow x^{\rm resp}_n

where each transformation involves explicit sequence generation.

2. Token Typology and Cross-Lingual Transfer

XS-CoT introduces four explicit token types to facilitate cross-lingual reasoning transfer:

Token Type Notation Function
Target-language Instruction xninstrx^{\rm instr}_n Aligns speech-encoder output to textual instruction in target
Core-language Instruction xcinstrx^{\rm instr}_c Enables leveraging the LLM's reasoning strength in core language (English)
Core-language Response xcrespx^{\rm resp}_c Encodes the English chain-of-thought reasoning and provisional answer
Target-language Response xnrespx^{\rm resp}_n Translates the English reasoning and answer to the target language

The full output sequence is

x=[xninstr,xcinstr,xcresp,xnresp]x = [x^{\rm instr}_n,\, x^{\rm instr}_c,\, x^{\rm resp}_c,\, x^{\rm resp}_n]

By sandwiching the English reasoning within bidirectional translations, XS-CoT transfers core-language reasoning to low-resource domains.

3. Semi-Implicit Chain-of-Thought Compression

To address the significant inference latency incurred by long core-language CoT chains (typically exceeding 100 tokens), XS-CoT employs a semi-implicit compression mechanism:

  • The English CoT response R=[s1,s2,...,sx]R = [s_1, s_2, ..., s_x] is partitioned into xx sentences.
  • Each sentence sis_i is further divided into word-groups; only the first kk groups are retained, followed by an ellipsis to indicate omission.
  • At training epoch nn, the first m(n)m(n) sentences are compressed, where

m(n)={min(x,n)n<m0 xnm0m(n) = \begin{cases} \min(x,n) & n < m_0 \ x & n \ge m_0 \end{cases}

  • The compression operator ckc_k applied to a sentence ss is defined as

ck(s)=[w1,...,wk,...](where wj are word-groups, followed by “”)c_k(s) = [w_1, ..., w_k, ...] \quad \text{(where } w_j \text{ are word-groups, followed by “…”)}

  • The output at epoch nn becomes

c(n)(R)=[ck(s1),...,ck(sm(n)),sm(n)+1,...,sx]c^{(n)}(R) = [c_k(s_1), ..., c_k(s_{m(n)}), s_{m(n)+1}, ..., s_x]

  • The training objective remains the standard next-token log-likelihood across all four token types:

L(θ)=t=1xlogpθ(xtx<t,speech)\mathcal{L}(\theta) = -\sum_{t=1}^{|x|} \log p_\theta\bigl(x_t \mid x_{<t},\, \text{speech}\bigr)

No reconstruction loss is required; the progressive compression guides the model to infer missing reasoning details.

4. Inference Workflow and Latency Analysis

During inference, XS-CoT proceeds in four explicit decoding phases:

  1. Decode xninstrx^{\rm instr}_n from speech.
  2. Decode xcinstrx^{\rm instr}_c, conditioned on xninstrx^{\rm instr}_n.
  3. Decode a compressed CoT chain c()(xcresp)c^{(\infty)}(x^{\rm resp}_c), conditioned on previous outputs, where all sentences are fully compressed.
  4. Decode xnrespx^{\rm resp}_n from the compressed CoT representation.

Because the majority of intermediate English tokens are compressed—each sentence retaining only kk word-groups—the delay before producing target-language responses is significantly reduced. Empirical measurements indicate that for a full chain of Dfull107D_{\rm full} \approx 107 tokens, semi-implicit compression yields Dsemi53D_{\rm semi} \approx 53, with a token delay reduction of approximately 50.5%. The compression ratio is defined as

ρ=DsemiDfull,Δ=1ρ\rho = \frac{D_{\rm semi}}{D_{\rm full}}, \quad \Delta = 1-\rho

where Δ0.505\Delta \approx 0.505 quantifies the delay reduction.

5. Data Pipeline and Resource Utilization

XS-CoT employs the “Multilingual Alpaca Speech” dataset, constructed for English (60K samples), Japanese (30K), French (10K), and German (10K). Each training sample is generated via:

  1. Extraction from Stanford Alpaca text instructions.
  2. Filtering for noise and quality.
  3. Translation to the target language.
  4. Synthetic speech generation using fish-speech TTS.
  5. Whisper ASR filtering (word error rate < 5%).

Crucially, the XS-CoT approach requires only a modest number of non-core language instruction-response speech examples. By leveraging core-language reasoning, the framework transfers chain-of-thought capability efficiently, achieving 2–3x higher sample efficiency over direct supervised fine-tuning.

6. Empirical Results and Trade-offs

XS-CoT delivers marked improvements in instruction-following quality for non-core languages, as measured by GPT-4 evaluation scores on Japanese test sets (OpenHermes & ALPACA):

Model Direct SFT XS-CoT Absolute Gain Relative Gain Tokens (full) Tokens (semi)
SALMONN-JA 28.4 50.3 +21.9 +77% 107 53

For SLLMs averaging over SALMONN and Qwen2Audio, XS-CoT achieves a 45% relative improvement over direct supervised fine-tuning. A comparative trade-off between compression and performance is observed:

Method GPT-4 score CoT tokens Delay reduction
Full XS-CoT 50.3 107
Semi-Implicit XS-CoT 43.0 53 50.5%

Adopting the semi-implicit scheme saves roughly half the reasoning-token latency with only a 14.5% drop in GPT-4 score, representing a competitive balance between speed and answer quality (Xue et al., 29 Apr 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to XS-CoT Framework.