What Makes Good Multilingual Reasoning? Disentangling Reasoning Traces with Measurable Features

Published 6 Apr 2026 in cs.CL and cs.AI | (2604.04720v1)

Abstract: Large Reasoning Models (LRMs) still exhibit large performance gaps between English and other languages, yet much current work assumes these gaps can be closed simply by making reasoning in every language resemble English reasoning. This work challenges this assumption by asking instead: what actually characterizes effective reasoning in multilingual settings, and to what extent do English-derived reasoning features genuinely help in other languages? We first define a suite of measurable reasoning features spanning multilingual alignment, reasoning step, and reasoning flow aspects of reasoning traces, and use logistic regression to quantify how each feature associates with final answer accuracy. We further train sparse autoencoders over multilingual traces to automatically discover latent reasoning concepts that instantiate or extend these features. Finally, we use the features as test-time selection policies to examine whether they can steer models toward stronger multilingual reasoning. Across two mathematical reasoning benchmarks, four LRMs, and 10 languages, we find that most features are positively associated with accuracy, but the strength of association varies considerably across languages and can even reverse in some. Our findings challenge English-centric reward designs and point toward adaptive objectives that accommodate language-specific reasoning patterns, with concrete implications for multilingual benchmark and reward design.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces 16 measurable reasoning features spanning multilingual alignment, step quality, and flow dynamics to quantify trace effectiveness.
Regression analysis and sparse autoencoder models reveal varying feature impacts across languages, highlighting both adaptive selection benefits and cross-lingual conflicts.
Test-time trace selection based on features like direct utility and self-checking yields significant accuracy gains, especially on challenging benchmarks.

Disentangling Multilingual Reasoning: Feature-Based Analysis and Adaptive Trace Evaluation

Introduction

This paper, "What Makes Good Multilingual Reasoning? Disentangling Reasoning Traces with Measurable Features" (2604.04720), systematically interrogates the assumptions underlying multilingual reasoning performance in Large Reasoning Models (LRMs). The authors challenge the prevalent paradigm of directly projecting reasoning practices from English to other languages, particularly the notion that English-derived reasoning features universally underpin effective reasoning across linguistic boundaries. Instead, they introduce a comprehensive framework that identifies, quantifies, and exploits measurable features within reasoning traces to dissect and optimize multilingual reasoning, moving beyond final answer correctness toward a nuanced, trace-level analysis.

Methodology

Feature Definition and Categorization

The core of the study is the definition of 16 measurable reasoning features that span three axes: multilingual alignment, reasoning step quality, and reasoning flow dynamics. These features encompass structural and semantic alignment to English traces, translation quality (COMET-QE), step count, logical validity (NLI-based), direct and indirect utility, $\mathcal{V}$ -Information, and cognitively-relevant tags (e.g., self-checking, plan generation, uncertainty management), all annotated and quantified for each language and trace.

Figure 1: Overview of the method, showing the complete suite of reasoning features and their association analyses via regression, SAEs, and test-time selection policies.

Statistical Analysis and Regression Modeling

For each feature, a univariate logistic regression is employed per language to model the association between normalized feature values and final answer correctness. The discrete change in predicted accuracy across $\pm1$ standard deviation of each feature ( $\Delta \mathrm{Acc}$ ) provides a robust effect size, enabling direct comparison across features and languages.

Sparse Autoencoder Discovery

To extend beyond the hand-designed feature set, sparse autoencoders (SAEs) are trained on multilingual reasoning traces. Latent neurons with high accuracy correlation are identified, interpreted via GPT-4o, and benchmarked against the established features. This approach offers a pathway to capture fine-grained and emergent reasoning behaviors beyond the scope of manual annotation.

Test-Time Trace Selection

Each feature is operationalized as an inference-time trace selection policy. For each query, 32 candidate reasoning traces are generated and ranked by feature value, enabling empirical evaluation of each policy's impact on pass@1 accuracy relative to random and English-similarity-based selection strategies.

Experimental Design

The evaluation leverages MGSM-Rev2 (middle-school math, human-translations) and AIME (high-school math, machine-translations) benchmarks across ten languages (Bengali, German, Spanish, French, Russian, Swahili, Telugu, Thai, Chinese, English) and four open-weight LRMs (Distill-Qwen 1.5b/7b, Qwen-3 4b/8b), encompassing diverse resource levels, typologies, and scripts.

Results

Regression Analysis of Feature Effects

Distinct patterns emerge from feature analysis:

Multilingual Alignment Features: COMET-QE, structural similarity, and semantic similarity to English all positively correlate with accuracy, but effect sizes vary substantially across languages and can reverse.
Reasoning Step Features: Validity, direct/indirect utility, and $\mathcal{V}$ -Information generally exhibit positive associations, though magnitude depends on task difficulty and language context.
Reasoning Flow Features: Cognitive tags related to computation, result consolidation, and self-checking consistently predict accuracy, but features like uncertainty management display negative or inconsistent associations, often diverging between English and non-English traces.

Figure 2: Comparative feature analysis results for English and non-English traces, highlighting cross-lingual variability in feature effect sizes.

Per-language decomposition reveals stronger conflicts: e.g., self-checking boosts $\Delta \mathrm{Acc}$ in English but penalizes accuracy in Swahili and Telugu, and final answer emission steps are advantageous in English but detrimental for some languages on AIME.

Figure 3: Per-language feature analysis results, exposing language-specific divergences in feature-accuracy relationships.

SAE Concept Discovery

SAEs not only corroborate feature-defined behaviors—e.g., preferred deduction structures or penalty for language-mixing—but also uncover latent concepts, such as ordinal breakdowns in Chinese traces or translation-error discussion behaviors in Bengali, that strongly associate with accuracy differentials.

Trace Selection and Policy Evaluation

The utility of features as selection policies is benchmarked:

Semantic similarity to English is competitive but not universally optimal; direct/indirect utility and result consolidation outperform semantic similarity on challenging benchmarks (AIME), yielding up to +10% gains in pass@1 for some LRMs.
Uncertainty management consistently and significantly lowers accuracy.
On easier benchmarks (MGSM-Rev2), feature-based selection offers negligible improvement over random selection.
Figure 4: Test-time selection performance across features and languages, demonstrating the superiority of certain step/flow features over English-similarity, especially on complex reasoning tasks.

Implications and Discussion

The analysis directly challenges the universality of English-centric reward and trace alignment objectives. The findings advocate for adaptive, language-aware benchmark and reward designs, emphasizing:

The crucial role of per-language translation quality (COMET-QE) in benchmark construction.
The transferability of reasoning step quality metrics (especially utility) across languages, supporting their adoption in language-agnostic evaluation and reward models.
The inadequacy of semantic similarity to English as a sole reward objective; step and flow features often yield stronger accuracy improvements.
The necessity for flexible reward and selection strategies that accommodate cross-lingual conflicts in feature effect sizes, leveraging adaptive reasoning and trace selection mechanisms.

Limitations

The study is constrained to mathematical benchmarks, four LRMs, and relies on GPT-4o for step annotation and dependency labeling. While human verification shows substantial agreement, future work should explore broader domains, additional architectures, and more extensive human/Machine annotation validation.

Conclusion

This paper delivers a systematic, feature-driven investigation into the dynamics of multilingual reasoning in LRMs, exposing cross-lingual variability and conflict in the effectiveness of reasoning features. English-centric reasoning paradigms are empirically challenged, and alternative reward signals—grounded in step utility and reasoning flow—are shown to offer stronger, language-adaptive performance improvements. The implications span benchmark design, evaluation methodology, and RL reward modeling, opening paths for more nuanced, robust, and generalizable multilingual reasoning capabilities in AI systems.

Markdown Report Issue