Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond (2503.10460v4)

Published 13 Mar 2025 in cs.CL and cs.LG

Abstract: This paper introduces Light-R1, an open-source suite for training long reasoning models using reproducible and cost-effective methodology. Given the proprietary nature of data used in the DeepSeek-R1 series, we develop an alternative approach leveraging exclusively public data and models. Our curriculum training progressively increases data difficulty, combined with multi-staged post-training. Our Light-R1-32B model, trained from Qwen2.5-32B-Instruct, outperforms DeepSeek-R1-Distill-Qwen-32B in math reasoning. Experimental results show that this curriculum approach becomes more effective when distinct, diverse datasets are available for different training stages: fine-tuning DeepSeek-R1-Distilled models (pre-tuned by DeepSeek team on proprietary data) with 3,000 challenging examples from our curriculum dataset yielded state-of-the-art 7B and 14B models, while the 32B model, Light-R1-32B-DS performed comparably to QwQ-32B and DeepSeek-R1. Furthermore, we extend our work by applying GRPO on long reasoning models. Our final Light-R1-14B-DS achieves SOTA performance among 14B models in math, with AIME24 & 25 scores of 74.0 and 60.2 respectively, surpassing many 32B models and DeepSeek-R1-Distill-Llama-70B. Despite math-focused training, Light-R1-14B-DS demonstrates strong cross-domain generalization. Light-R1 represents a significant advancement in making sophisticated reasoning models more accessible and implementable in real-world applications. Our models, training data and code have been made available at https://github.com/Qihoo360/Light-R1.

Summary

The paper introduces the Light-R1 series of smaller models trained from scratch using a curriculum of SFT, DPO, and RL to achieve high long chain-of-thought reasoning performance.
Key models like Light-R1-32B and Light-R1-14B-DS achieved high scores on AIME tasks, surpassing larger models and demonstrating efficiency.
This research shows that efficient long-COT reasoning is possible with smaller models, opening avenues for applications in computationally constrained environments and guiding future work on training techniques.

The paper "Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond" (2503.10460) introduces the Light-R1 series of models, focusing on training LLMs from scratch to excel in long chain-of-thought (long COT) reasoning. The work emphasizes achieving high reasoning performance with smaller, computationally efficient models.

Overview of Light-R1

The core objective of Light-R1 is to develop models capable of long COT reasoning without the extensive computational demands typically associated with larger models (e.g., those exceeding 70B parameters). The approach involves a curriculum-based training regimen incorporating Supervised Fine-Tuning (SFT), Dual Policy Optimization (DPO), and reinforcement learning via GRPO. This methodology aims to cultivate robust reasoning capabilities in models that initially lack long COT aptitude.

Key results that highlight the efficacy of the Light-R1 series include:

The Light-R1-32B model, which attained scores of 76.6% on AIME24 and 64.6% on AIME25, surpassing benchmarks previously set by DeepSeek-R1-Distill-Qwen-32B.
A 2% improvement in mathematical reasoning performance was observed following reinforcement learning, as demonstrated by the Light-R1-14B-DS model.
The Light-R1-14B-DS model attains AIME24 and AIME25 scores of 74.0 and 60.2 respectively, outperforming many 32B models and even DeepSeek-R1-Distill-Llama-70B.

Methodology in Detail

The training methodology is structured around a two-stage curriculum:

Two-Stage SFT: Initially, models are trained using a 76k dataset focused on mathematical reasoning. Subsequently, fine-tuning is performed using a 3k high-difficulty dataset. This staged approach is crucial for progressively developing reasoning capacities in models that start without inherent long-COT capabilities.
Semi-On-Policy DPO: Preference-based optimization is employed to refine model responses. This involves using verified response pairs to enhance reward scores without incurring high computational costs.
Reinforcement Learning (RL): GRPO is utilized to further enhance models after the secondary fine-tuning stage. The focus is on improving response length and overall reward scores, which are critical for the intended reasoning tasks, without causing performance degradation. The RL training shows simultaneous increases in response length and reward score.

Implications and Future Research

This research signifies considerable advancements in deploying high-performance reasoning models within computationally constrained environments. Light-R1 models offer efficient long-COT reasoning, which is particularly relevant for applications such as real-time problem-solving, mathematical computations, scientific investigations, and algorithmic planning.

The paper suggests that future work should explore model merging techniques and curriculum designs tailored for response-length optimization and reward stabilization. The application of RL methodology on Light-R1 models provides a clear direction for such explorations.

Future research directions include investigating multi-domain generalization capabilities, extending the models' proficiency beyond mathematics, and dynamically adapting curriculum structures for interdisciplinary knowledge integration. Moreover, leveraging open-source datasets and methods, as demonstrated in the Light-R1 project, could foster collaborative advancements in AI reasoning across both academic and industrial sectors.

In summary, the Light-R1 series represents a significant step forward in balancing computational efficiency and reasoning power in AI models. By using a curriculum-based approach with SFT, DPO, and RL, the Light-R1 models achieve competitive performance in long chain-of-thought reasoning, opening avenues for further research and application in various domains.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (14)

Tweets

https://twitter.com/_akhaliq/status/1900395747104710710