- The paper demonstrates that multi-domain RLVR enhances in-domain performance, achieving up to 99.71% on puzzle benchmarks while balancing trade-offs in math and code.
- The paper employs GRPO and curriculum learning to systematically analyze cross-domain effects, revealing that reward design and template consistency are critical for effective transfer.
- The paper identifies challenges such as negative transfer and language sensitivity, underscoring the need for adaptive reward functions and data-mixing strategies.
Data-Centric Analysis of Multi-Domain Reasoning in RLVR for LLMs
This paper presents a comprehensive empirical paper of multi-domain reasoning in LLMs under the Reinforcement Learning with Verifiable Rewards (RLVR) paradigm. The authors systematically investigate the interplay between mathematical reasoning, code generation, and logical puzzle solving, focusing on how domain-specific and cross-domain data, reward design, curriculum learning, template alignment, and language affect both in-domain and out-of-domain generalization. The paper leverages the Qwen-2.5-7B model family and the Group Relative Policy Optimization (GRPO) algorithm, providing detailed quantitative and qualitative insights into the mechanisms governing multi-domain RL for LLMs.
Experimental Design and Methodology
The experimental setup is rigorous and controlled, with careful curation of datasets for each domain (Math: DeepScaleR, CountDown; Code: CodeR1-12k; Puzzle: Knights-and-Knaves, Logic Puzzle Baron). Data scales are normalized to ensure comparability. RL training is performed using GRPO, which eschews a value model in favor of group-based advantage estimation, and all experiments are conducted on a consistent hardware cluster. Evaluation spans a suite of established benchmarks (MATH500, AIME24, CountDown, HumanEval, MBPP, KK, ZebraLogicBench), with strict 0-shot or 3-shot settings and explicit template control to ensure reproducibility.
Key Empirical Findings
Single-Domain RLVR
- Math RLVR: Substantial in-domain gains (e.g., Base-DSR: +19.6% on MATH500, +75.56% on CountDown) are observed. However, math training degrades code performance (e.g., Base-CD: 29.59% vs. baseline 67.46%), indicating negative transfer between math and code domains.
- Code RLVR: Code RLVR yields strong in-domain improvements (Base: +10.37% on HumanEval), but cross-domain effects are model-dependent. Instruct models benefit in OOD tasks, while base models often experience performance drops, attributed to output format rigidity induced by code data.
- Puzzle RLVR: Puzzle RLVR dramatically boosts in-domain logical reasoning (Base-KK: 94.29% on KK), with positive transfer to math but inconsistent or negative effects on code.
Cross-Domain and Multi-Domain RLVR
- Dual-Domain Training: Certain combinations (e.g., Math+Puzzle) yield synergistic improvements in math and puzzle, but can harm code performance. Puzzle+Code achieves the best overall dual-domain balance (+19.39% over baseline).
- Triple-Domain Training: Incorporating all three domains further increases overall average performance (56.57%), but introduces negative transfer for highly specialized tasks (e.g., puzzle accuracy drops compared to Puzzle+Code). The triple-domain setup achieves the most balanced performance profile, mitigating catastrophic forgetting and extreme drops in any single domain.
Data and Training Factors
- Template Consistency: Mismatched training and evaluation templates cause severe performance degradation across all domains, especially for complex tasks. Matched templates (R1-style) consistently yield optimal results.
- Curriculum Learning: Staged training from easy to hard, with periodic policy refresh, raises the upper bound of performance (KK: up to 99.71%), accelerates convergence, and improves stability.
- Reward Design: Binary rewards are optimal for tasks with low solution sparsity (KK), but fail for harder, sparse-reward tasks (LPB), where partial or rescaled rewards are necessary. However, current partial rewards lack granularity, as they operate at the response rather than cell level.
- Language Sensitivity: RLVR models trained in Chinese underperform their English-trained counterparts, even with strict language enforcement in reward functions, highlighting persistent cross-lingual generalization gaps.
Quantitative Highlights
- Math RLVR: Base-DSR improves MATH500 from 56.40% to 76.00%; CountDown from 1.05% to 76.61%.
- Code RLVR: Base-CodeR1 increases HumanEval from 70.12% to 80.49%; Instruct-CodeR1 reaches 84.15%.
- Puzzle RLVR: Base-KK achieves 94.29% on KK; Instruct-KK reaches 99.14%.
- Triple-Domain RLVR: Overall average performance reaches 56.57%, the highest among all configurations.
Contradictory and Notable Claims
- Contradictory Cross-Domain Effects: Code RLVR enhances OOD performance for instruct models but degrades it for base models, challenging the assumption of uniform transferability.
- Template Robustness: The paper demonstrates that RLVR generalization is not robust to template variation, contradicting claims of template-agnostic reasoning in some prior works.
- Reward Universality: No single reward scheme is optimal across all tasks; binary rewards can cause training collapse in sparse-reward settings, while partial rewards introduce noise in easier tasks.
Theoretical and Practical Implications
The findings have several implications for the design and deployment of RLVR-based LLMs:
- Domain Interactions: Multi-domain RLVR can yield both positive and negative transfer. Careful selection and balancing of training data are required to avoid catastrophic forgetting and to maximize generalization.
- Specialization vs. Generalization: Highly specialized domains (e.g., puzzles) may suffer from negative transfer when mixed with unrelated data, suggesting the need for adaptive data mixing or modular training strategies.
- Reward Engineering: Task-specific, fine-grained reward functions are essential, especially for complex, multi-entity reasoning tasks. Future work should explore cell-level or step-level reward signals.
- Template and Language Control: Strict alignment of training and evaluation templates, as well as language, is critical for reliable deployment. This has direct implications for real-world applications where prompt formats and languages may vary.
Future Directions
The paper suggests several avenues for further research:
- Finer-Grained Domain Taxonomy: Expanding beyond Math, Code, and Puzzle to include science and general reasoning domains could yield more nuanced insights into data-centric RLVR.
- Model Generality: Extending experiments to other architectures (e.g., Llama, DeepSeek) will clarify the generality of observed phenomena.
- Reward Function Innovation: Developing more granular, interpretable, and adaptive reward mechanisms is necessary for robust multi-domain reasoning.
- Cross-Lingual RLVR: Addressing the persistent performance gap in non-English RLVR remains an open challenge, with implications for global deployment.
Conclusion
This work provides a detailed, data-centric analysis of multi-domain RLVR for LLMs, revealing the nuanced dynamics of domain interaction, the criticality of template and reward design, and the challenges of cross-lingual generalization. The empirical results and methodological recommendations offer a foundation for optimizing RLVR pipelines to achieve robust, generalizable, and balanced reasoning capabilities in LLMs. The codebase is available at https://github.com/Leey21/A-Data-Centric-Study.