Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning (2507.00432v1)

Published 1 Jul 2025 in cs.AI and cs.CL

Abstract: Math reasoning has become the poster child of progress in LLMs, with new models rapidly surpassing human-level performance on benchmarks like MATH and AIME. But as math leaderboards improve week by week, it is worth asking: do these gains reflect broader problem-solving ability or just narrow overfitting? To answer this question, we evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks, including math, scientific QA, agent planning, coding, and standard instruction-following. We surprisingly find that most models that succeed in math fail to transfer their gains to other domains. To rigorously study this phenomenon, we conduct controlled experiments on Qwen3-14B models using math-only data but different tuning methods. We find that reinforcement learning (RL)-tuned models generalize well across domains, while supervised fine-tuning (SFT)-tuned models often forget general capabilities. Latent-space representation and token-space distribution shift analyses reveal that SFT induces substantial representation and output drift, while RL preserves general-domain structure. Our results suggest a need to rethink standard post-training recipes, particularly the reliance on SFT-distilled data for advancing reasoning models.

Summary

The paper reveals that RL-tuned models achieve superior transferability across tasks compared to SFT-tuned models.
It introduces a novel Transferability Index and employs latent-space and token-level analyses to isolate the effects of fine-tuning paradigms.
Findings indicate that RL preserves general-domain representations while SFT leads to catastrophic forgetting on non-reasoning tasks.

Transferability of Mathematical Reasoning in LLMs: A Systematic Analysis of Fine-Tuning Paradigms

The paper "Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning" (2507.00432) presents a comprehensive empirical paper on the transferability of mathematical reasoning capabilities in LLMs to broader reasoning and non-reasoning tasks. The authors systematically evaluate over 20 open-weight, reasoning-tuned models and conduct controlled experiments to disentangle the effects of fine-tuning paradigms—specifically, supervised fine-tuning (SFT) versus reinforcement learning (RL)—on cross-domain generalization.

Motivation and Problem Statement

Recent advances in LLMs have led to rapid progress on math-centric benchmarks, with models surpassing human-level performance on datasets such as MATH and AIME. However, the extent to which these improvements in mathematical reasoning transfer to other domains—such as scientific QA, coding, agent planning, and general instruction following—remains unclear. The central question addressed is whether gains in mathematical reasoning reflect broader problem-solving ability or are merely the result of narrow overfitting.

Experimental Design and Methodology

The paper evaluates models across three task groups:

Math Reasoning: MATH500, AIME24/25, OlympiadBench
Other Reasoning: LiveCodeBench, GPQA-Diamond, ACPBench, HeadQA
Non-Reasoning: CoQA, IFEval, HaluEval, MC-TACO

A novel metric, the Transferability Index (TI), is introduced to quantify the relative performance gain in non-math domains normalized by the gain in math reasoning. Positive TI indicates successful transfer, while negative TI indicates degradation.

To isolate the effect of fine-tuning paradigms, the authors conduct controlled experiments on Qwen3-14B, fine-tuning on identical math-only data using either SFT (with teacher-forced chain-of-thought traces) or RL (using answer correctness as reward). This design ensures that observed differences are attributable to the optimization method rather than data or architecture.

Key Findings

1. Fine-Tuning Paradigm is the Primary Driver of Transferability

RL-tuned models consistently achieve higher TI on both other reasoning and non-reasoning tasks, regardless of model size or architecture.
SFT-tuned models often exhibit negative TI on non-reasoning tasks, indicating catastrophic forgetting and over-specialization to the math domain.

2. Latent Representation and Output Distribution Stability

PCA analysis of hidden states reveals that RL induces minimal drift in latent representations across all task types, preserving general-domain structure.
SFT induces substantial latent and output drift, especially for non-reasoning inputs, leading to representation collapse and degraded generalization.

3. Token-Level Distributional Effects

KL-divergence and token rank shift analyses show that RL-tuned models maintain output distributions close to the base model, selectively shifting only task-relevant tokens.
SFT-tuned models exhibit widespread, indiscriminate token shifts, including the introduction of reasoning tokens into non-reasoning tasks, resulting in unnecessary "overthinking" and performance loss.

4. Numerical Results

In controlled studies, RL-tuned Qwen3-14B achieves positive TI on both other reasoning (+79.6) and non-reasoning (+29.3) tasks, while SFT-tuned variants show negative or near-zero TI on non-reasoning tasks (e.g., -41.2, -250.2).
RL-tuned models outperform SFT-tuned models by substantial margins on non-math benchmarks, even when both are trained on the same math-only data.

Implications

Practical

RL-based post-training is essential for developing LLMs that retain general-domain capabilities while improving on specialized reasoning tasks. This has direct implications for the design of LLM training pipelines in both academic and industrial settings.
SFT on narrow, static datasets can lead to catastrophic forgetting, undermining the utility of LLMs in real-world, multi-domain applications.
Token-level and latent-space diagnostics should be standard practice for evaluating the impact of fine-tuning on model generalization.

Theoretical

The findings challenge the assumption that improvements in mathematical reasoning automatically translate to broader cognitive abilities in LLMs.
The results support the hypothesis that on-policy RL updates reinforce desired skills without disrupting general-domain representations, while off-policy SFT can induce representation collapse.
The paper provides empirical evidence for the importance of optimization dynamics—beyond data and architecture—in shaping the functional capacity of LLMs.

Future Directions

Scaling RL-based fine-tuning to larger models and more diverse reasoning domains, including multimodal and embodied tasks.
Developing hybrid or curriculum-based fine-tuning strategies that combine the stability of RL with the efficiency of SFT.
Investigating the interplay between pre-training data diversity, model size, and fine-tuning paradigm in determining cross-domain generalization.
Extending latent-space and token-level analyses to other forms of post-training, such as direct preference optimization and process supervision.

Conclusion

This work provides a rigorous, multi-faceted analysis of the transferability of mathematical reasoning in LLMs, demonstrating that RL-based fine-tuning is critical for preserving and enhancing general-domain capabilities. The results have immediate implications for the development and deployment of LLMs in settings where both specialized reasoning and broad competence are required. The diagnostic framework established here sets a new standard for evaluating the impact of fine-tuning paradigms on LLM generalization.

PDF Markdown

Related Papers

Tweets

https://twitter.com/xiangyue96/status/1940494376133869947

https://twitter.com/fly51fly/status/1940520895665393894

https://twitter.com/arxivexplained/status/1941149347384856768

https://twitter.com/susumuota/status/1941291570830971074

https://twitter.com/ZiebaMat/status/1941150988108493025

https://twitter.com/susumuota/status/1941291560148095065

YouTube

Show All Videos

Reddit

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning (13 points, 1 comment)