Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective (2506.14965v1)

Published 17 Jun 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Reinforcement learning (RL) has emerged as a promising approach to improve LLM reasoning, yet most open efforts focus narrowly on math and code, limiting our understanding of its broader applicability to general reasoning. A key challenge lies in the lack of reliable, scalable RL reward signals across diverse reasoning domains. We introduce Guru, a curated RL reasoning corpus of 92K verifiable examples spanning six reasoning domains--Math, Code, Science, Logic, Simulation, and Tabular--each built through domain-specific reward design, deduplication, and filtering to ensure reliability and effectiveness for RL training. Based on Guru, we systematically revisit established findings in RL for LLM reasoning and observe significant variation across domains. For example, while prior work suggests that RL primarily elicits existing knowledge from pretrained models, our results reveal a more nuanced pattern: domains frequently seen during pretraining (Math, Code, Science) easily benefit from cross-domain RL training, while domains with limited pretraining exposure (Logic, Simulation, and Tabular) require in-domain training to achieve meaningful performance gains, suggesting that RL is likely to facilitate genuine skill acquisition. Finally, we present Guru-7B and Guru-32B, two models that achieve state-of-the-art performance among open models RL-trained with publicly available data, outperforming best baselines by 7.9% and 6.7% on our 17-task evaluation suite across six reasoning domains. We also show that our models effectively improve the Pass@k performance of their base models, particularly on complex tasks less likely to appear in pretraining data. We release data, models, training and evaluation code to facilitate general-purpose reasoning at: https://github.com/LLM360/Reasoning360

Summary

The paper demonstrates how RL improves reasoning capabilities, especially in less pre-trained domains like Logic, Simulation, and Tabular.
It introduces Guru-7B and Guru-32B models that outperform baselines by 7.9% and 6.7% across varied evaluation tasks.
The study highlights the importance of domain-specific reward schemes and diverse task difficulties to fine-tune LLM reasoning.

Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective

The paper "Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective" addresses the integration of reinforcement learning (RL) with LLMs to enhance reasoning capabilities across various domains. The paper focuses on the creation and utilization of a new dataset, Guru, which comprises 92K examples across six distinct reasoning domains: Math, Code, Science, Logic, Simulation, and Tabular. Each domain-specific sub-corpus within the dataset is constructed with careful considerations for data deduplication, domain-specific reward schemas, and filtering processes to ensure high-quality training data for reinforcement learning (RL).

A critical finding of the paper is the domain-specific nature of RL’s effectiveness in reasoning models. It reports that while RL has been successfully applied to extract and refine knowledge in well-pretrained domains such as Math, Code, and Science, its potential to genuinely develop reasoning skills emerges in domains with less exposure during pretraining, such as Logic, Simulation, and Tabular. This nuanced insight challenges the prevailing notion that RL primarily serves as a mechanism to elicit existing knowledge subtleties in pretrained LLMs.

The authors introduce Guru-7B and Guru-32B, models that demonstrate significantly improved reasoning performance over the open models that were RL-trained using publicly available data. These models outperform baseline models by 7.9% and 6.7%, respectively, on a comprehensive suite of evaluation tasks across the six specified reasoning domains. The models particularly excel in tasks less probed during conventional pretraining, thus emphasizing RL's role in skill acquisition beyond typical knowledge refinement.

Furthermore, the paper's analysis uncovers notable differences in RL mechanisms across different domains. For instance, domains that frequently appeared in pretraining datasets show prominent gains from cross-domain learning, whereas domains lacking such pre-exposure demand in-domain training for meaningful performance enhancements. This observation has significant implications for designing training regimens and datasets that cater to both eliciting latent talents in LLMs and cultivating new reasoning abilities.

The paper also provides insights into the dynamics of RL-based reasoning, where different domains exhibit varying patterns in reward and response behaviors during RL fine-tuning. Additionally, the tactic of employing diverse task difficulties in datasets showcases further complexities in transferring learning from one domain to another, which requires careful calibration for bias towards or against particular domain features.

In light of these insights, the paper argues for a paradigm shift towards multi-domain RL research to cultivate broad-spectrum reasoning competencies in LLMs. The results and methodologies presented in this work open avenues for future exploration of embedding domain-diverse training data into learning architectures, which can enhance the versatility and deeper reasoning capacity of next-generation AI models.

Overall, this work contributes significantly to understanding cross-domain generalization and the impact of RL in LLM reasoning improvements, paving the way for more comprehensive and robust AI reasoning capabilities. The availability of the Guru dataset and experimental code further encourages advancements and continuity in general-purpose reasoning research. The release of the dataset and models also sets a precedent for transparency and collaboration in AI research, highlighting how being open about methodologies can elevate collective progress in the field.