- The paper introduces the Light-R1 series of smaller models trained from scratch using a curriculum of SFT, DPO, and RL to achieve high long chain-of-thought reasoning performance.
- Key models like Light-R1-32B and Light-R1-14B-DS achieved high scores on AIME tasks, surpassing larger models and demonstrating efficiency.
- This research shows that efficient long-COT reasoning is possible with smaller models, opening avenues for applications in computationally constrained environments and guiding future work on training techniques.
The paper "Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond" (2503.10460) introduces the Light-R1 series of models, focusing on training LLMs from scratch to excel in long chain-of-thought (long COT) reasoning. The work emphasizes achieving high reasoning performance with smaller, computationally efficient models.
Overview of Light-R1
The core objective of Light-R1 is to develop models capable of long COT reasoning without the extensive computational demands typically associated with larger models (e.g., those exceeding 70B parameters). The approach involves a curriculum-based training regimen incorporating Supervised Fine-Tuning (SFT), Dual Policy Optimization (DPO), and reinforcement learning via GRPO. This methodology aims to cultivate robust reasoning capabilities in models that initially lack long COT aptitude.
Key results that highlight the efficacy of the Light-R1 series include:
- The Light-R1-32B model, which attained scores of 76.6% on AIME24 and 64.6% on AIME25, surpassing benchmarks previously set by DeepSeek-R1-Distill-Qwen-32B.
- A 2% improvement in mathematical reasoning performance was observed following reinforcement learning, as demonstrated by the Light-R1-14B-DS model.
- The Light-R1-14B-DS model attains AIME24 and AIME25 scores of 74.0 and 60.2 respectively, outperforming many 32B models and even DeepSeek-R1-Distill-Llama-70B.
Methodology in Detail
The training methodology is structured around a two-stage curriculum:
- Two-Stage SFT: Initially, models are trained using a 76k dataset focused on mathematical reasoning. Subsequently, fine-tuning is performed using a 3k high-difficulty dataset. This staged approach is crucial for progressively developing reasoning capacities in models that start without inherent long-COT capabilities.
- Semi-On-Policy DPO: Preference-based optimization is employed to refine model responses. This involves using verified response pairs to enhance reward scores without incurring high computational costs.
- Reinforcement Learning (RL): GRPO is utilized to further enhance models after the secondary fine-tuning stage. The focus is on improving response length and overall reward scores, which are critical for the intended reasoning tasks, without causing performance degradation. The RL training shows simultaneous increases in response length and reward score.
Implications and Future Research
This research signifies considerable advancements in deploying high-performance reasoning models within computationally constrained environments. Light-R1 models offer efficient long-COT reasoning, which is particularly relevant for applications such as real-time problem-solving, mathematical computations, scientific investigations, and algorithmic planning.
The paper suggests that future work should explore model merging techniques and curriculum designs tailored for response-length optimization and reward stabilization. The application of RL methodology on Light-R1 models provides a clear direction for such explorations.
Future research directions include investigating multi-domain generalization capabilities, extending the models' proficiency beyond mathematics, and dynamically adapting curriculum structures for interdisciplinary knowledge integration. Moreover, leveraging open-source datasets and methods, as demonstrated in the Light-R1 project, could foster collaborative advancements in AI reasoning across both academic and industrial sectors.
In summary, the Light-R1 series represents a significant step forward in balancing computational efficiency and reasoning power in AI models. By using a curriculum-based approach with SFT, DPO, and RL, the Light-R1 models achieve competitive performance in long chain-of-thought reasoning, opening avenues for further research and application in various domains.