- The paper reveals that RL-tuned models achieve superior transferability across tasks compared to SFT-tuned models.
- It introduces a novel Transferability Index and employs latent-space and token-level analyses to isolate the effects of fine-tuning paradigms.
- Findings indicate that RL preserves general-domain representations while SFT leads to catastrophic forgetting on non-reasoning tasks.
Transferability of Mathematical Reasoning in LLMs: A Systematic Analysis of Fine-Tuning Paradigms
The paper "Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning" (2507.00432) presents a comprehensive empirical paper on the transferability of mathematical reasoning capabilities in LLMs to broader reasoning and non-reasoning tasks. The authors systematically evaluate over 20 open-weight, reasoning-tuned models and conduct controlled experiments to disentangle the effects of fine-tuning paradigms—specifically, supervised fine-tuning (SFT) versus reinforcement learning (RL)—on cross-domain generalization.
Motivation and Problem Statement
Recent advances in LLMs have led to rapid progress on math-centric benchmarks, with models surpassing human-level performance on datasets such as MATH and AIME. However, the extent to which these improvements in mathematical reasoning transfer to other domains—such as scientific QA, coding, agent planning, and general instruction following—remains unclear. The central question addressed is whether gains in mathematical reasoning reflect broader problem-solving ability or are merely the result of narrow overfitting.
Experimental Design and Methodology
The paper evaluates models across three task groups:
- Math Reasoning: MATH500, AIME24/25, OlympiadBench
- Other Reasoning: LiveCodeBench, GPQA-Diamond, ACPBench, HeadQA
- Non-Reasoning: CoQA, IFEval, HaluEval, MC-TACO
A novel metric, the Transferability Index (TI), is introduced to quantify the relative performance gain in non-math domains normalized by the gain in math reasoning. Positive TI indicates successful transfer, while negative TI indicates degradation.
To isolate the effect of fine-tuning paradigms, the authors conduct controlled experiments on Qwen3-14B, fine-tuning on identical math-only data using either SFT (with teacher-forced chain-of-thought traces) or RL (using answer correctness as reward). This design ensures that observed differences are attributable to the optimization method rather than data or architecture.
Key Findings
1. Fine-Tuning Paradigm is the Primary Driver of Transferability
- RL-tuned models consistently achieve higher TI on both other reasoning and non-reasoning tasks, regardless of model size or architecture.
- SFT-tuned models often exhibit negative TI on non-reasoning tasks, indicating catastrophic forgetting and over-specialization to the math domain.
2. Latent Representation and Output Distribution Stability
- PCA analysis of hidden states reveals that RL induces minimal drift in latent representations across all task types, preserving general-domain structure.
- SFT induces substantial latent and output drift, especially for non-reasoning inputs, leading to representation collapse and degraded generalization.
3. Token-Level Distributional Effects
- KL-divergence and token rank shift analyses show that RL-tuned models maintain output distributions close to the base model, selectively shifting only task-relevant tokens.
- SFT-tuned models exhibit widespread, indiscriminate token shifts, including the introduction of reasoning tokens into non-reasoning tasks, resulting in unnecessary "overthinking" and performance loss.
4. Numerical Results
- In controlled studies, RL-tuned Qwen3-14B achieves positive TI on both other reasoning (+79.6) and non-reasoning (+29.3) tasks, while SFT-tuned variants show negative or near-zero TI on non-reasoning tasks (e.g., -41.2, -250.2).
- RL-tuned models outperform SFT-tuned models by substantial margins on non-math benchmarks, even when both are trained on the same math-only data.
Implications
Practical
- RL-based post-training is essential for developing LLMs that retain general-domain capabilities while improving on specialized reasoning tasks. This has direct implications for the design of LLM training pipelines in both academic and industrial settings.
- SFT on narrow, static datasets can lead to catastrophic forgetting, undermining the utility of LLMs in real-world, multi-domain applications.
- Token-level and latent-space diagnostics should be standard practice for evaluating the impact of fine-tuning on model generalization.
Theoretical
- The findings challenge the assumption that improvements in mathematical reasoning automatically translate to broader cognitive abilities in LLMs.
- The results support the hypothesis that on-policy RL updates reinforce desired skills without disrupting general-domain representations, while off-policy SFT can induce representation collapse.
- The paper provides empirical evidence for the importance of optimization dynamics—beyond data and architecture—in shaping the functional capacity of LLMs.
Future Directions
- Scaling RL-based fine-tuning to larger models and more diverse reasoning domains, including multimodal and embodied tasks.
- Developing hybrid or curriculum-based fine-tuning strategies that combine the stability of RL with the efficiency of SFT.
- Investigating the interplay between pre-training data diversity, model size, and fine-tuning paradigm in determining cross-domain generalization.
- Extending latent-space and token-level analyses to other forms of post-training, such as direct preference optimization and process supervision.
Conclusion
This work provides a rigorous, multi-faceted analysis of the transferability of mathematical reasoning in LLMs, demonstrating that RL-based fine-tuning is critical for preserving and enhancing general-domain capabilities. The results have immediate implications for the development and deployment of LLMs in settings where both specialized reasoning and broad competence are required. The diagnostic framework established here sets a new standard for evaluating the impact of fine-tuning paradigms on LLM generalization.