When Models Reason in Your Language: Controlling Thinking Trace Language Comes at the Cost of Accuracy (2505.22888v1)

Published 28 May 2025 in cs.CL

Abstract: Recent Large Reasoning Models (LRMs) with thinking traces have shown strong performance on English reasoning tasks. However, their ability to think in other languages is less studied. This capability is as important as answer accuracy for real world applications because users may find the reasoning trace useful for oversight only when it is expressed in their own language. We comprehensively evaluate two leading families of LRMs on our XReasoning benchmark and find that even the most advanced models often revert to English or produce fragmented reasoning in other languages, revealing a substantial gap in multilingual reasoning. Prompt based interventions that force models to reason in the users language improve readability and oversight but reduce answer accuracy, exposing an important trade off. We further show that targeted post training on just 100 examples mitigates this mismatch, though some accuracy loss remains. Our results highlight the limited multilingual reasoning capabilities of current LRMs and outline directions for future work. Code and data are available at https://github.com/Betswish/mCoT-XReasoning.

Summary

The paper reveals a trade-off between forcing user-specified thinking trace language and maintaining high answer accuracy.
It introduces the XReasoning benchmark to evaluate multilingual reasoning across math and science tasks in 11 languages.
Techniques like prompt hacking significantly boost language matching rates but lead to a notable drop in answer accuracy.

This paper (2505.22888) addresses a crucial, often overlooked aspect of multilingual Large Reasoning Models (LRMs): their ability to generate step-by-step thinking traces in the user's specified language. While LRMs have shown impressive reasoning capabilities, primarily evaluated in English, the interpretability and trustworthiness of these models in non-English contexts depend significantly on whether their internal thought process, expressed as a thinking trace, is understandable to the user.

The authors highlight a significant gap: current state-of-the-art LRMs frequently fail to reason in the user's native language when prompted, often defaulting to English or producing fragmented multi-language output. This language mismatch in thinking traces undermines user trust and oversight, even if the final answer is correct.

The paper makes several key contributions:

It identifies and reveals a critical trade-off between achieving high language matching rates in thinking traces and maintaining answer accuracy for multilingual users.
It introduces and open-sources a new benchmark called XReasoning, designed with more challenging math and science questions across 11 languages to better evaluate the multilingual reasoning capabilities of advanced LRMs. The benchmark is built upon existing datasets like AIME, GPQA, and MGSM, with questions translated into other languages by GPT-4o-mini.
It demonstrates that while simple prompt-hacking techniques can substantially improve language matching in thinking traces, they come at a significant cost to answer accuracy.
It explores targeted post-training with a small number of instances as a mitigation strategy, showing that it improves language matching but does not fully resolve the accuracy trade-off.

For practical implementation, the paper evaluates six state-of-the-art open-source LRMs from the DeepSeek-Distilled-R1 and Skywork-OR1 families on the XReasoning benchmark. The evaluation uses two primary metrics: language matching rate (percentage of instances where the thinking trace is correctly generated in the specified language, identified using the LangDetect toolkit) and answer accuracy (exact match for the final boxed answer).

Standard Prompting: Under standard prompting, even large models like Distilled-R1-32B show low language matching rates (e.g., 46.3% on AIME, 42.3% on GPQA) for non-English thinking traces. This means the model's step-by-step reasoning is often not in the language the user requested.

Prompt Hacking: To investigate the ability to control the thinking language, the authors employ a prompt-hacking technique. This involves adding a prefix like > By request, I will start thinking in \{USER_LANG\}. translated into the user's language, immediately after the opening <think> token. This method is highly effective at forcing the model to generate traces in the specified language. As shown in Table 1 and Figure 1, prompt hacking boosts language matching rates dramatically (e.g., for Distilled-R1-32B on AIME, matching increases from 46.3% to 97.9%). However, this comes with a substantial decrease in answer accuracy (e.g., on AIME, accuracy drops from 25.5% to 17.0%). This highlights the core trade-off: forcing thinking in a specific language, especially for languages where the model's reasoning capability might be weaker, degrades performance on the task itself. The analysis of actual thinking languages (Tables 3 and 4) reveals that models often revert to English or Chinese when instructed to think in other languages like French, Japanese, Thai, or Swahili, suggesting a preference related to training data distribution.

Post-Training Mitigation: Recognizing the language preference issue, the paper investigates whether fine-tuning on a small dataset can improve multilingual thinking. They post-train the Distilled-R1-7B model on translated math problems with step-by-step solutions (from the LIMO dataset) using 100 or 250 instances per target language (Japanese, Thai, Telugu). As illustrated in Figure 2 and Tables 5-7, this targeted post-training significantly improves the language matching rate for these languages (approaching 100% for Thai and Telugu with 100 instances). However, similar to prompt hacking, this gain in matching rate is accompanied by a decrease in answer accuracy compared to the original model's accuracy on these tasks. Increasing the training data from 100 to 250 instances does not reliably recover the lost accuracy and can even decrease matching rates in some cases.

The paper concludes that while methods like prompt hacking and few-shot post-training can increase the likelihood of LRMs generating thinking traces in a specified non-English language, they currently do so at the expense of answer accuracy, particularly on more complex reasoning tasks. This persistent trade-off implies that current methods for controlling thinking language may interfere with the model's core reasoning process or rely on surface-level language generation rather than truly multilingual reasoning.

For practitioners deploying multilingual LRMs, this research highlights a critical challenge: achieving both high task performance (accuracy) and user-friendly interpretability (thinking in the user's language) simultaneously is difficult with current models and simple adaptation techniques. The findings suggest that more advanced methods, potentially involving reinforcement learning or deeper architectural changes to better align reasoning capabilities across languages, are necessary to overcome this trade-off and build truly trustworthy multilingual reasoning systems for real-world applications.

The paper notes several limitations for future work, including the benchmark's focus on specific task types, potential noise from machine translation of the benchmark and training data, limitations of the automatic language detection tool, and the need for user-centered evaluation to understand the actual impact of thinking trace language on trust and usability.