XS-CoT: Cross-lingual Speech Chain-of-Thought
- XS-CoT is a modular framework that integrates cross-lingual chain-of-thought reasoning and token compression to enhance instruction-following in low-resource languages.
- It employs a structured pipeline that transcribes non-core language speech, translates to core language for reasoning, and re-translates the response back to the target language.
- Empirical results show up to a 50.5% reduction in inference latency with a balanced trade-off in performance, significantly improving instruction quality.
The semi-implicit Cross-lingual Speech Chain-of-Thought (XS-CoT) framework is an approach for enhancing non-core language instruction-following in Speech LLMs (SLLMs) by integrating cross-lingual chain-of-thought (CoT) reasoning and token compression. XS-CoT leverages a structured pipeline involving speech-to-text translation, core-language reasoning, and targeted token compression to address the scarcity of high-quality non-core language speech-text data and the limited multilingual reasoning capabilities characteristic of existing SLLMs. Empirical results demonstrate substantial improvements in instruction-following quality and inference latency, particularly in low-resource (non-core) languages, by exploiting the robust reasoning capacity of core-LLMs (Xue et al., 29 Apr 2025).
1. System Architecture
XS-CoT is built on a modular stack consisting of a speech encoder, modal adapter, and LLM. The processing pipeline operates as follows:
- Input: Speech in a non-core (target) language, denoted (e.g., Japanese, German, French).
- Stage 1 (Modal Alignment): The speech encoder and adapter transcribe speech into a text instruction in the same language.
- Stage 2 (XS-CoT Fine-tuning): The pipeline translates to a core-language (English) instruction , performs chain-of-thought reasoning in English to produce , then translates the English response back into the target language as .
- Stage 3 (Semi-Implicit CoT Compression): The first three token types are progressively compressed, producing only a compact sketch of the reasoning chain during training to reduce inference latency, while retaining the global reasoning logic.
The pipeline is formalized as follows:
where each transformation involves explicit sequence generation.
2. Token Typology and Cross-Lingual Transfer
XS-CoT introduces four explicit token types to facilitate cross-lingual reasoning transfer:
| Token Type | Notation | Function |
|---|---|---|
| Target-language Instruction | Aligns speech-encoder output to textual instruction in target | |
| Core-language Instruction | Enables leveraging the LLM's reasoning strength in core language (English) | |
| Core-language Response | Encodes the English chain-of-thought reasoning and provisional answer | |
| Target-language Response | Translates the English reasoning and answer to the target language |
The full output sequence is
By sandwiching the English reasoning within bidirectional translations, XS-CoT transfers core-language reasoning to low-resource domains.
3. Semi-Implicit Chain-of-Thought Compression
To address the significant inference latency incurred by long core-language CoT chains (typically exceeding 100 tokens), XS-CoT employs a semi-implicit compression mechanism:
- The English CoT response is partitioned into sentences.
- Each sentence is further divided into word-groups; only the first groups are retained, followed by an ellipsis to indicate omission.
- At training epoch , the first sentences are compressed, where
- The compression operator applied to a sentence is defined as
- The output at epoch becomes
- The training objective remains the standard next-token log-likelihood across all four token types:
No reconstruction loss is required; the progressive compression guides the model to infer missing reasoning details.
4. Inference Workflow and Latency Analysis
During inference, XS-CoT proceeds in four explicit decoding phases:
- Decode from speech.
- Decode , conditioned on .
- Decode a compressed CoT chain , conditioned on previous outputs, where all sentences are fully compressed.
- Decode from the compressed CoT representation.
Because the majority of intermediate English tokens are compressed—each sentence retaining only word-groups—the delay before producing target-language responses is significantly reduced. Empirical measurements indicate that for a full chain of tokens, semi-implicit compression yields , with a token delay reduction of approximately 50.5%. The compression ratio is defined as
where quantifies the delay reduction.
5. Data Pipeline and Resource Utilization
XS-CoT employs the “Multilingual Alpaca Speech” dataset, constructed for English (60K samples), Japanese (30K), French (10K), and German (10K). Each training sample is generated via:
- Extraction from Stanford Alpaca text instructions.
- Filtering for noise and quality.
- Translation to the target language.
- Synthetic speech generation using fish-speech TTS.
- Whisper ASR filtering (word error rate < 5%).
Crucially, the XS-CoT approach requires only a modest number of non-core language instruction-response speech examples. By leveraging core-language reasoning, the framework transfers chain-of-thought capability efficiently, achieving 2–3x higher sample efficiency over direct supervised fine-tuning.
6. Empirical Results and Trade-offs
XS-CoT delivers marked improvements in instruction-following quality for non-core languages, as measured by GPT-4 evaluation scores on Japanese test sets (OpenHermes & ALPACA):
| Model | Direct SFT | XS-CoT | Absolute Gain | Relative Gain | Tokens (full) | Tokens (semi) |
|---|---|---|---|---|---|---|
| SALMONN-JA | 28.4 | 50.3 | +21.9 | +77% | 107 | 53 |
For SLLMs averaging over SALMONN and Qwen2Audio, XS-CoT achieves a 45% relative improvement over direct supervised fine-tuning. A comparative trade-off between compression and performance is observed:
| Method | GPT-4 score | CoT tokens | Delay reduction |
|---|---|---|---|
| Full XS-CoT | 50.3 | 107 | – |
| Semi-Implicit XS-CoT | 43.0 | 53 | 50.5% |
Adopting the semi-implicit scheme saves roughly half the reasoning-token latency with only a 14.5% drop in GPT-4 score, representing a competitive balance between speed and answer quality (Xue et al., 29 Apr 2025).