Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning (2502.17407v1)

Published 24 Feb 2025 in cs.CL

Abstract: Scaling pre-training compute has proven effective for achieving mulitlinguality, but does the same hold for test-time scaling? In this work, we introduce MCLM, a multilingual math benchmark featuring competition-level problems in 55 languages. We test three test-time scaling methods-Outcome Reward Modeling (ORM), Process Reward Modeling (ORM), and Budget Forcing (BF)-on both Qwen2.5-1.5B Math and MR1-1.5B, a multilingual LLM we trained for extended reasoning. Our experiments show that using Qwen2.5-1.5B Math with ORM achieves a score of 35.8 on MCLM, while BF on MR1-1.5B attains 35.2. Although "thinking LLMs" have recently garnered significant attention, we find that their performance is comparable to traditional scaling methods like best-of-N once constrained to similar levels of inference FLOPs. Moreover, while BF yields a 20-point improvement on English AIME, it provides only a 1.94-point average gain across other languages-a pattern consistent across the other test-time scaling methods we studied-higlighting that test-time scaling may not generalize as effectively to multilingual tasks. To foster further research, we release MCLM, MR1-1.5B, and evaluation results.

Summary

The paper evaluates test-time scaling methods for multilingual mathematical reasoning on a 55-language benchmark.
While scaling yields significant gains on English benchmarks (up to 20 points), average gains across other languages are minimal (+1.94 points).
Experimental analysis shows scaling techniques increase variance and reduce cross-lingual consistency despite accuracy gains in methods like ORM.

This paper evaluates the effectiveness of test‐time scaling methods for multilingual mathematical reasoning using a novel competition-level benchmark spanning 55 languages.

The authors introduce MCLM and MR1-1.5B, enabling evaluation and training of multilingual LLMs on high-complexity math problems, while emphasizing the limitations of translated word problems.
They rigorously compare Outcome Reward Modeling, Process Reward Modeling, and Budget Forcing under equivalent inference FLOPs, finding that although English benchmarks (e.g., AIME) gain up to 20 points, gains average only +1.94 points across other languages.
Experimental analysis reveals that while ORM generally outperforms PRM in accuracy, all scaling techniques incur increased variance and reduced cross-lingual consistency when constrained to similar computational budgets.

PDF Markdown

Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning (2502.17407v1)

Summary

Related Papers