Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond (2503.10460v4)
Abstract: This paper introduces Light-R1, an open-source suite for training long reasoning models using reproducible and cost-effective methodology. Given the proprietary nature of data used in the DeepSeek-R1 series, we develop an alternative approach leveraging exclusively public data and models. Our curriculum training progressively increases data difficulty, combined with multi-staged post-training. Our Light-R1-32B model, trained from Qwen2.5-32B-Instruct, outperforms DeepSeek-R1-Distill-Qwen-32B in math reasoning. Experimental results show that this curriculum approach becomes more effective when distinct, diverse datasets are available for different training stages: fine-tuning DeepSeek-R1-Distilled models (pre-tuned by DeepSeek team on proprietary data) with 3,000 challenging examples from our curriculum dataset yielded state-of-the-art 7B and 14B models, while the 32B model, Light-R1-32B-DS performed comparably to QwQ-32B and DeepSeek-R1. Furthermore, we extend our work by applying GRPO on long reasoning models. Our final Light-R1-14B-DS achieves SOTA performance among 14B models in math, with AIME24 & 25 scores of 74.0 and 60.2 respectively, surpassing many 32B models and DeepSeek-R1-Distill-Llama-70B. Despite math-focused training, Light-R1-14B-DS demonstrates strong cross-domain generalization. Light-R1 represents a significant advancement in making sophisticated reasoning models more accessible and implementable in real-world applications. Our models, training data and code have been made available at https://github.com/Qihoo360/Light-R1.
- Liang Wen (2 papers)
- Yunke Cai (2 papers)
- Fenrui Xiao (1 paper)
- Xin He (135 papers)
- Qi An (99 papers)
- Zhenyu Duan (2 papers)
- Yimin Du (5 papers)
- Junchen Liu (9 papers)
- Lifu Tang (1 paper)
- Xiaowei Lv (5 papers)
- Haosheng Zou (6 papers)
- Yongchao Deng (3 papers)
- Shousheng Jia (2 papers)
- Xiangzheng Zhang (10 papers)