This paper, "Crosslingual Reasoning through Test-Time Scaling" (Yong et al., 8 May 2025 ), investigates the crosslingual generalization capabilities of English-centric Reasoning LLMs (RLMs) when scaling up compute at test time. RLMs, which excel at generating long chain-of-thoughts (CoTs) for complex reasoning tasks, are primarily trained and evaluated in English. This work explores whether this English-centric training can transfer reasoning abilities to other languages, particularly through the lens of test-time scaling (allocating more inference budget, like longer CoTs).
The authors use s1 models (Muennighoff et al., 31 Jan 2025 ), which are multilingual Qwen2.5-Instruct models (Qwen et al., 19 Dec 2024 ) finetuned on a small dataset (1k samples) of English STEM reasoning data. They evaluate s1 models of various sizes (1.5B, 3B, 7B, 14B, 32B parameters) on several multilingual benchmarks, primarily the Multilingual Grade School Math (MGSM) dataset [shi2023mgsm], but also Global-MMLU (Singh et al., 4 Dec 2024 ), FORK [palta-rudinger-2023-fork], and COPAL-ID [wibowo-etal-2024-copal].
The research addresses four key questions:
- How effective is test-time scaling for English-centric RLMs on multilingual tasks?
- What language-mixing behaviors do these models exhibit?
- How well do they perform when forced to reason in non-English languages?
- Does crosslingual reasoning generalization extend beyond the STEM domain?
Key Findings and Practical Implications:
- Crosslingual Test-Time Scaling is Effective for Larger Models:
- Test-time scaling (allowing longer CoTs) significantly improves multilingual mathematical reasoning accuracy on MGSM, outperforming base models and some prior state-of-the-art models trained on multilingual data.
- This benefit is most pronounced for models with 3 billion parameters or more. Models smaller than 3B show minimal gains, contrasting with prior work that used smaller models (Son et al., 24 Feb 2025 ) and reported negative findings.
- Increasing the maximum thinking tokens from 0.5k to 8k yielded substantial accuracy gains, particularly for the 14B model (+9.4% average increase).
- A Pareto frontier analysis shows that larger models (14B, 32B) achieve higher accuracy ceilings unattainable by smaller models, even with significant compute. The 14B model is identified as a good "sweet spot" in terms of accuracy-to-compute efficiency.
- Both truncation (hard token limit) and extrapolation (prompting with "Wait" tokens) strategies for test-time scaling show similar performance gains when sufficient budget is allocated.
- "Quote-and-Think" is a Dominant Language-Mixing Pattern:
- After English-centric finetuning, the s1 models predominantly use English as their reasoning language, even when the input is in another language. This indicates a shift in dominant language compared to their multilingual base models (Qwen2.5-Instruct), which tend to respond in the input language.
- When interacting with non-English inputs, s1 models exhibit a consistent "quote-and-think" pattern. They often quote non-English phrases from the input question within their English CoTs and then reason about the quoted material. This suggests the base model's multilingual understanding capabilities enable the English reasoning to process foreign-language inputs.
- Analysis of intrasentential mixing shows this "quote-and-think" behavior (even without explicit quotation marks) is the primary form of mixing. Other patterns like intersentential switching are less common but can occur, especially for languages like Russian.
- Language Forcing Performance Depends on Language Resourcefulness:
- It is possible to control the reasoning language of s1 models using techniques like prefixes and system prompts, achieving high language compliance (near 100% in some cases with a "combined" strategy).
- Forcing reasoning into high-resource languages (HRLs) like German, French, or Spanish maintains performance similar to or slightly better than reasoning in English.
- Forcing reasoning into low-resource languages (LRLs) like Bengali, Swahili, or Telugu generally leads to substantial performance degradation compared to letting the model reason in English. Strategies allowing some English mix (like "translated_wait") perform better than strict in-language forcing for LRLs.
- There's a trade-off between language compliance and task accuracy; methods achieving near-perfect compliance often result in lower accuracy, especially for LRLs.
- Crosslingual language forcing experiments show that reasoning in HRLs (like English or French) often yields better performance than reasoning in the input language, particularly when the input is in an LRL. This suggests translating inputs to an HRL before feeding to the RLM could be a practical strategy.
- Reasoning in LRLs is also less token-efficient, requiring substantially more tokens for the same task, leading to higher inference costs. This is attributed to tokenization disparities across languages [2023.pdf].
- Poor Cross-Domain Generalization:
- While the English STEM reasoning finetuning effectively transfers to multilingual STEM tasks (like MGSM and Global-MMLU STEM subdomains), the benefits are minimal or even negative for non-STEM domains.
- Test-time scaling provides large gains for STEM but shows little improvement, or sometimes decreased performance (overthinking (Liu et al., 27 Oct 2024 )), for domains like medicine, humanities, or social sciences.
- For cultural commonsense tasks (FORK, COPAL-ID), scaling up thinking tokens did not improve performance and could even hurt it.
- The impact of finetuning and test-time scaling on culturally-specific vs. culturally-agnostic questions within Global-MMLU is inconsistent across domains.
Implementation Considerations for Practitioners:
- Model Selection: For crosslingual reasoning tasks, choose multilingual base models with at least 3 billion parameters to effectively leverage test-time scaling. Larger models (14B+) offer the best performance ceiling.
- Finetuning Strategy: If performing English-centric reasoning finetuning, keep the training data size and epochs small (similar to s1's 1k samples, 5 epochs) to minimize catastrophic forgetting of multilingual capabilities.
- Inference Strategy: Implement test-time scaling by allowing models to generate long CoTs (e.g., up to 8k tokens for 14B models). You can enforce this using truncation (setting
max_new_tokens
) or extrapolation (Wait
tokens). - Reasoning Language: For optimal performance and token efficiency, encourage the model to reason in English or other high-resource languages it is proficient in, even if the input query is in a different language. Consider translating inputs to an HRL if the model struggles with the input language.
- Language Forcing: If strict in-language reasoning is required for HRLs, the "combined" forcing strategy (prefix + system prompt + translated "Wait") is effective for compliance but may slightly impact performance. For LRLs, avoid strict in-language forcing; strategies allowing English reasoning alongside the target language (like
translated_wait
) are preferable but still result in lower accuracy and higher costs compared to HRLs. - Domain Limitations: Be aware that reasoning capabilities acquired from training data in one domain (e.g., STEM) may not generalize well to others. Test-time scaling is unlikely to improve performance for tasks outside the training domain.
- Low-Resource Languages: Multilingual reasoning in LRLs remains challenging due to poorer performance and higher inference costs caused by tokenization inefficiencies. Further research is needed to improve LRL reasoning capabilities and address tokenization disparities.
In conclusion, test-time scaling is a powerful technique to unlock the crosslingual reasoning potential of English-centric RLMs, particularly for models 3B and larger, but its effectiveness is largely confined to domains similar to the finetuning data and is significantly hindered for low-resource languages and out-of-domain tasks. Practitioners should strategically choose model size, finetuning approach, and reasoning language based on these findings.