Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
13 tokens/sec
Gemini 2.5 Pro Pro
37 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning (2507.17512v1)

Published 23 Jul 2025 in cs.AI and cs.LG

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of LLMs. Existing research has predominantly concentrated on isolated reasoning domains such as mathematical problem-solving, coding tasks, or logical reasoning. However, real world reasoning scenarios inherently demand an integrated application of multiple cognitive skills. Despite this, the interplay among these reasoning skills under reinforcement learning remains poorly understood. To bridge this gap, we present a systematic investigation of multi-domain reasoning within the RLVR framework, explicitly focusing on three primary domains: mathematical reasoning, code generation, and logical puzzle solving. We conduct a comprehensive study comprising four key components: (1) Leveraging the GRPO algorithm and the Qwen-2.5-7B model family, our study thoroughly evaluates the models' in-domain improvements and cross-domain generalization capabilities when trained on single-domain datasets. (2) Additionally, we examine the intricate interactions including mutual enhancements and conflicts that emerge during combined cross-domain training. (3) To further understand the influence of SFT on RL, we also analyze and compare performance differences between base and instruct models under identical RL configurations. (4) Furthermore, we delve into critical RL training details, systematically exploring the impacts of curriculum learning strategies, variations in reward design, and language-specific factors. Through extensive experiments, our results offer significant insights into the dynamics governing domain interactions, revealing key factors influencing both specialized and generalizable reasoning performance. These findings provide valuable guidance for optimizing RL methodologies to foster comprehensive, multi-domain reasoning capabilities in LLMs.

Summary

  • The paper demonstrates that targeted RLVR training achieves significant in-domain improvements, such as a 19.60% gain in math reasoning accuracy.
  • It reveals non-monotonic cross-domain transfer effects, where combining domains can boost some tasks while degrading others, notably in code performance.
  • Methodological insights stress the critical role of strict template alignment, tailored reward design, and curriculum learning in optimizing LLM reasoning.

Data-Centric Analysis of Multi-Domain Reasoning in RLVR for LLMs

This paper presents a comprehensive empirical paper of multi-domain reasoning in LLMs under the Reinforcement Learning with Verifiable Rewards (RLVR) paradigm, focusing on the interplay between mathematical reasoning, code generation, and logical puzzle solving. The work systematically investigates how domain-specific and cross-domain data, reward design, curriculum learning, template alignment, and language choice affect both in-domain and out-of-domain generalization, using the Qwen-2.5-7B model family and the Group Relative Policy Optimization (GRPO) algorithm.

Experimental Design and Methodology

The paper employs a rigorous experimental protocol, curating balanced datasets for each domain: DeepScaleR and CountDown for math, CodeR1-12k for code, and Knights-and-Knaves (KK) and Logic Puzzle Baron (LPB) for puzzles. All experiments are conducted with standardized prompt templates (R1-style) and evaluated on established benchmarks (MATH500, AIME24, CountDown, HumanEval, MBPP, KK, ZebraLogicBench) using OpenCompass. The RL optimization is performed with GRPO, which eschews a value model in favor of group-based advantage estimation, and all training is executed on 8×A100 GPUs.

Key Empirical Findings

Single-Domain RLVR

  • Math RLVR: Substantial in-domain gains (e.g., Base-DSR improves MATH500 accuracy from 56.40% to 76.00%), with positive transfer to puzzle tasks but notable degradation in code performance (e.g., Base-CD drops code average from 67.46% to 29.59%).
  • Code RLVR: Strong in-domain improvements (Base model HumanEval: 70.12%→80.49%), but cross-domain effects are model-dependent: instruct models benefit in OOD tasks, while base models often experience performance drops, attributed to output format rigidity.
  • Puzzle RLVR: Targeted training yields near-saturation in-domain (KK: 94.29% base, 99.14% instruct), with positive transfer to math tasks but inconsistent or negative effects on code.

Cross-Domain and Multi-Domain RLVR

  • Dual-domain training (e.g., Math+Puzzle, Puzzle+Code) can yield synergistic improvements, but adding domains does not guarantee monotonic gains. For instance, Math+Puzzle improves math and puzzle averages but degrades code performance, highlighting domain conflict.
  • Triple-domain training (Math+Code+Puzzle) achieves the highest overall average (56.57%), with improved task balance and generalization, but at the cost of reduced peak performance in highly specialized tasks (e.g., puzzle accuracy drops compared to Puzzle+Code).
  • Catastrophic forgetting is observed in some cross-domain settings, especially when domain-specific data distributions are not carefully balanced.

Supervised Fine-Tuning (SFT) and RLVR

  • SFT prior to RL consistently boosts RLVR effectiveness, with instruct models outperforming base models across all domains and exhibiting greater robustness to cross-domain transfer.

Template Consistency

  • Template mismatch between training and evaluation leads to severe performance degradation (e.g., instruct model MATH500: 73.20% with R1 template vs. 1.80% with Qwen template), underscoring the necessity of strict template alignment in RLVR pipelines.

Curriculum Learning

  • Curriculum learning, stratified by task difficulty, raises the upper bound of model performance (KK: 94.29%→99.71%), and a policy refresh strategy (periodic reference model and optimizer reset) further accelerates convergence and stabilizes training.

Reward Design

  • Reward scheme efficacy is highly dataset-dependent:
    • Binary rewards are optimal for tasks with low solution sparsity (KK), but fail on harder, sparse-reward tasks (LPB).
    • Partial and rescaled rewards are necessary for complex, multi-entity tasks, but current response-level partial rewards lack granularity, limiting their effectiveness.
  • No universal reward strategy exists; reward design must be tailored to task structure and difficulty.

Language Sensitivity

  • RLVR models trained in Chinese consistently underperform compared to English, even with strict language enforcement in reward functions, indicating a persistent cross-lingual gap in reasoning capabilities.

Numerical Highlights

  • Base-DSR (math RLVR): MATH500 +19.60%, CountDown +75.56% over base.
  • Base-CodeR1 (code RLVR): HumanEval +10.37%, but MATH500 −5.60%.
  • Puzzle RLVR: KK accuracy up to 99.14% (instruct), with math OOD transfer (MATH500: 73.20%).
  • Triple-domain RLVR: All-task average 56.57%, highest among all configurations.
  • Template mismatch: Instruct model MATH500 drops from 73.20% (R1) to 1.80% (Qwen).
  • Curriculum learning with policy refresh: KK accuracy 99.71% (vs. 94.29% mixed).

Implications and Theoretical Considerations

The results demonstrate that domain interactions in RLVR are nontrivial and often non-monotonic. While certain domain pairs (e.g., math and puzzle) exhibit mutual enhancement, others (e.g., code with base models) can induce negative transfer or catastrophic forgetting. The findings challenge the assumption that more data diversity always yields better generalization, highlighting the need for careful data curation, reward engineering, and curriculum design.

The critical role of template alignment exposes a significant vulnerability in current RLVR pipelines, with practical implications for reproducibility and deployment. The language sensitivity of RLVR-trained models further suggests that cross-lingual generalization remains an open challenge, likely requiring architectural or algorithmic advances beyond data scaling.

The paper also exposes the limitations of current reward mechanisms, particularly the inadequacy of response-level partial rewards for complex, multi-entity tasks. This points to the necessity of developing finer-grained, structure-aware reward functions and possibly integrating external verifiers or symbolic reasoning modules.

Future Directions

  • Finer-grained reward modeling: Cell-level or step-level rewards for complex reasoning tasks.
  • Expanded domain coverage: Inclusion of science, general reasoning, and other cognitive domains to further probe domain interactions.
  • Cross-lingual RLVR: Algorithmic innovations to close the performance gap in non-English reasoning.
  • Robust template handling: Methods for template-agnostic RLVR or automatic template adaptation.
  • Scaling to larger models and alternative architectures: Extending analysis to Llama, DeepSeek, and other LLMs.

Conclusion

This work provides a detailed, data-centric analysis of multi-domain reasoning in LLMs under RLVR, revealing nuanced patterns of domain interaction, the importance of SFT, the criticality of template and reward design, and the persistent challenges in cross-lingual and cross-domain generalization. The empirical findings offer actionable guidelines for practitioners and highlight several open research directions for advancing robust, generalizable reasoning in LLMs.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com