Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning (2507.17512v1)

Published 23 Jul 2025 in cs.AI and cs.LG

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of LLMs. Existing research has predominantly concentrated on isolated reasoning domains such as mathematical problem-solving, coding tasks, or logical reasoning. However, real world reasoning scenarios inherently demand an integrated application of multiple cognitive skills. Despite this, the interplay among these reasoning skills under reinforcement learning remains poorly understood. To bridge this gap, we present a systematic investigation of multi-domain reasoning within the RLVR framework, explicitly focusing on three primary domains: mathematical reasoning, code generation, and logical puzzle solving. We conduct a comprehensive study comprising four key components: (1) Leveraging the GRPO algorithm and the Qwen-2.5-7B model family, our study thoroughly evaluates the models' in-domain improvements and cross-domain generalization capabilities when trained on single-domain datasets. (2) Additionally, we examine the intricate interactions including mutual enhancements and conflicts that emerge during combined cross-domain training. (3) To further understand the influence of SFT on RL, we also analyze and compare performance differences between base and instruct models under identical RL configurations. (4) Furthermore, we delve into critical RL training details, systematically exploring the impacts of curriculum learning strategies, variations in reward design, and language-specific factors. Through extensive experiments, our results offer significant insights into the dynamics governing domain interactions, revealing key factors influencing both specialized and generalizable reasoning performance. These findings provide valuable guidance for optimizing RL methodologies to foster comprehensive, multi-domain reasoning capabilities in LLMs.

Summary

The paper demonstrates that multi-domain RLVR enhances in-domain performance, achieving up to 99.71% on puzzle benchmarks while balancing trade-offs in math and code.
The paper employs GRPO and curriculum learning to systematically analyze cross-domain effects, revealing that reward design and template consistency are critical for effective transfer.
The paper identifies challenges such as negative transfer and language sensitivity, underscoring the need for adaptive reward functions and data-mixing strategies.

Data-Centric Analysis of Multi-Domain Reasoning in RLVR for LLMs

This paper presents a comprehensive empirical paper of multi-domain reasoning in LLMs under the Reinforcement Learning with Verifiable Rewards (RLVR) paradigm. The authors systematically investigate the interplay between mathematical reasoning, code generation, and logical puzzle solving, focusing on how domain-specific and cross-domain data, reward design, curriculum learning, template alignment, and language affect both in-domain and out-of-domain generalization. The paper leverages the Qwen-2.5-7B model family and the Group Relative Policy Optimization (GRPO) algorithm, providing detailed quantitative and qualitative insights into the mechanisms governing multi-domain RL for LLMs.

Experimental Design and Methodology

The experimental setup is rigorous and controlled, with careful curation of datasets for each domain (Math: DeepScaleR, CountDown; Code: CodeR1-12k; Puzzle: Knights-and-Knaves, Logic Puzzle Baron). Data scales are normalized to ensure comparability. RL training is performed using GRPO, which eschews a value model in favor of group-based advantage estimation, and all experiments are conducted on a consistent hardware cluster. Evaluation spans a suite of established benchmarks (MATH500, AIME24, CountDown, HumanEval, MBPP, KK, ZebraLogicBench), with strict 0-shot or 3-shot settings and explicit template control to ensure reproducibility.

Key Empirical Findings

Single-Domain RLVR

Math RLVR: Substantial in-domain gains (e.g., Base-DSR: +19.6% on MATH500, +75.56% on CountDown) are observed. However, math training degrades code performance (e.g., Base-CD: 29.59% vs. baseline 67.46%), indicating negative transfer between math and code domains.
Code RLVR: Code RLVR yields strong in-domain improvements (Base: +10.37% on HumanEval), but cross-domain effects are model-dependent. Instruct models benefit in OOD tasks, while base models often experience performance drops, attributed to output format rigidity induced by code data.
Puzzle RLVR: Puzzle RLVR dramatically boosts in-domain logical reasoning (Base-KK: 94.29% on KK), with positive transfer to math but inconsistent or negative effects on code.

Cross-Domain and Multi-Domain RLVR

Dual-Domain Training: Certain combinations (e.g., Math+Puzzle) yield synergistic improvements in math and puzzle, but can harm code performance. Puzzle+Code achieves the best overall dual-domain balance (+19.39% over baseline).
Triple-Domain Training: Incorporating all three domains further increases overall average performance (56.57%), but introduces negative transfer for highly specialized tasks (e.g., puzzle accuracy drops compared to Puzzle+Code). The triple-domain setup achieves the most balanced performance profile, mitigating catastrophic forgetting and extreme drops in any single domain.

Data and Training Factors

Template Consistency: Mismatched training and evaluation templates cause severe performance degradation across all domains, especially for complex tasks. Matched templates (R1-style) consistently yield optimal results.
Curriculum Learning: Staged training from easy to hard, with periodic policy refresh, raises the upper bound of performance (KK: up to 99.71%), accelerates convergence, and improves stability.
Reward Design: Binary rewards are optimal for tasks with low solution sparsity (KK), but fail for harder, sparse-reward tasks (LPB), where partial or rescaled rewards are necessary. However, current partial rewards lack granularity, as they operate at the response rather than cell level.
Language Sensitivity: RLVR models trained in Chinese underperform their English-trained counterparts, even with strict language enforcement in reward functions, highlighting persistent cross-lingual generalization gaps.

Quantitative Highlights

Math RLVR: Base-DSR improves MATH500 from 56.40% to 76.00%; CountDown from 1.05% to 76.61%.
Code RLVR: Base-CodeR1 increases HumanEval from 70.12% to 80.49%; Instruct-CodeR1 reaches 84.15%.
Puzzle RLVR: Base-KK achieves 94.29% on KK; Instruct-KK reaches 99.14%.
Triple-Domain RLVR: Overall average performance reaches 56.57%, the highest among all configurations.

Contradictory and Notable Claims

Contradictory Cross-Domain Effects: Code RLVR enhances OOD performance for instruct models but degrades it for base models, challenging the assumption of uniform transferability.
Template Robustness: The paper demonstrates that RLVR generalization is not robust to template variation, contradicting claims of template-agnostic reasoning in some prior works.
Reward Universality: No single reward scheme is optimal across all tasks; binary rewards can cause training collapse in sparse-reward settings, while partial rewards introduce noise in easier tasks.

Theoretical and Practical Implications

The findings have several implications for the design and deployment of RLVR-based LLMs:

Domain Interactions: Multi-domain RLVR can yield both positive and negative transfer. Careful selection and balancing of training data are required to avoid catastrophic forgetting and to maximize generalization.
Specialization vs. Generalization: Highly specialized domains (e.g., puzzles) may suffer from negative transfer when mixed with unrelated data, suggesting the need for adaptive data mixing or modular training strategies.
Reward Engineering: Task-specific, fine-grained reward functions are essential, especially for complex, multi-entity reasoning tasks. Future work should explore cell-level or step-level reward signals.
Template and Language Control: Strict alignment of training and evaluation templates, as well as language, is critical for reliable deployment. This has direct implications for real-world applications where prompt formats and languages may vary.

Future Directions

The paper suggests several avenues for further research:

Finer-Grained Domain Taxonomy: Expanding beyond Math, Code, and Puzzle to include science and general reasoning domains could yield more nuanced insights into data-centric RLVR.
Model Generality: Extending experiments to other architectures (e.g., Llama, DeepSeek) will clarify the generality of observed phenomena.
Reward Function Innovation: Developing more granular, interpretable, and adaptive reward mechanisms is necessary for robust multi-domain reasoning.
Cross-Lingual RLVR: Addressing the persistent performance gap in non-English RLVR remains an open challenge, with implications for global deployment.

Conclusion

This work provides a detailed, data-centric analysis of multi-domain RLVR for LLMs, revealing the nuanced dynamics of domain interaction, the criticality of template and reward design, and the challenges of cross-lingual generalization. The empirical results and methodological recommendations offer a foundation for optimizing RLVR pipelines to achieve robust, generalizable, and balanced reasoning capabilities in LLMs. The codebase is available at https://github.com/Leey21/A-Data-Centric-Study.

PDF Markdown

Follow-up Questions

Related Papers

Authors (6)

Tweets

https://twitter.com/_akhaliq/status/1948449704964227571

https://twitter.com/HuggingPapers/status/1948236932498247812

YouTube

Show All Videos

Reddit

Can Reasoning Skills Learned in One Domain Generalize Across other Domains? (3 points, 1 comment)