RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems? (2501.11284v1)

Published 20 Jan 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Can scaling transform reasoning? In this work, we explore the untapped potential of scaling Long Chain-of-Thought (Long-CoT) data to 1000k samples, pioneering the development of a slow-thinking model, RedStar. Through extensive experiments with various LLMs and different sizes, we uncover the ingredients for specialization and scale for Long-CoT training. Surprisingly, even smaller models show significant performance gains with limited data, revealing the sample efficiency of Long-CoT and the critical role of sample difficulty in the learning process. Our findings demonstrate that Long-CoT reasoning can be effectively triggered with just a few thousand examples, while larger models achieve unparalleled improvements. We also introduce reinforcement learning (RL)-scale training as a promising direction for advancing slow-thinking systems. RedStar shines across domains: on the MATH-Hard benchmark, RedStar-code-math boosts performance from 66.2\% to 81.6\%, and on the USA Math Olympiad (AIME), it solves 46.7\% of problems using only 21k mixed-code-math datasets. In multimodal tasks like GeoQA and MathVista-GEO, RedStar-Geo achieves competitive results with minimal Long-CoT data, outperforming other slow-thinking systems like QvQ-Preview. Compared to QwQ, RedStar strikes the perfect balance between reasoning and generalizability. Our work highlights that, with careful tuning, scaling Long-CoT can unlock extraordinary reasoning capabilities-even with limited dataset and set a new standard for slow-thinking models across diverse challenges. Our data and models are released at https://huggingface.co/RedStar-Reasoning.

Summary

The paper demonstrates that scaling Long-CoT instruction data to 1000k samples significantly improves reasoning performance in mathematical and multimodal tasks.
Key factors for effective Long-CoT training include the scale of data, the base model size, and the difficulty of the training samples.
The RedStar model, trained on scaled Long-CoT data and optimized with DPO/PPO, achieves strong results on benchmarks like MATH-Hard and extends Long-CoT to multimodal reasoning.

The paper "RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems?" explores the impact of scaling Long Chain-of-Thought (Long-CoT) data for training slow-thinking models. The authors introduce RedStar, a model developed through scaling Long-CoT data to 1000k samples, and conduct experiments to identify key factors for specialization and scale in Long-CoT training. The paper investigates the effects of varying data sizes, model scales, and reinforcement learning strategies on reasoning performance across mathematical and multimodal tasks.

Key Questions and Contributions

The paper addresses these key questions:

Does scaling up Long-CoT instruction data enhance a model's slow, deliberate reasoning ability?
What methods for Long-CoT dataset construction have the best sample efficiency?
How do base model scale and specialization impact Long-CoT data scaling?
Can Long-CoT and its scaled-up versions enhance performance in multimodal tasks?

The primary contributions include:

Demonstration of performance gains using Long-CoT, even with smaller models and limited data, highlighting the sample efficiency of Long-CoT.
Identification of the critical role of sample difficulty in the learning process.
Evidence that Long-CoT reasoning can be triggered with a few thousand examples, while larger models achieve improvements with extensive data.
Exploration of reinforcement learning (RL) for advancing slow-thinking systems.
Introduction of RedStar, which achieves strong results on benchmarks like MATH-Hard and USA Math Olympiad (AIME).
Extension of Long-CoT to multimodal tasks, achieving competitive results on GeoQA and MathVista-GEO.

Long-CoT Data Curation

The Long-CoT data curation process for mathematics data involves three steps:

Prompt Collection: Gathering prompts from diverse sources, including math, code-based, and multimodal datasets such as MetaMathQA, Numica-CoT, TACO, and GeoQA.
Difficulty-Level Scoring and Augmentation: Annotating and filtering prompts based on difficulty levels using a difficulty-level model trained on MATH and Omni-MATH datasets. Problems with difficulty levels 7 and above are selected for augmentation using the QwQ model.
Response Verification: Evaluating response quality by sampling each question multiple times using QwQ, applying manual rules to remove repeated or incorrect responses, and retaining positive samples for supervised fine-tuning (SFT) and negative samples for subsequent RL training.

The final dataset consists of 1000k samples, including 220k prompts with difficulty levels 3-9, and a subset of 1.3k prompts with 4k correct prompt-response pairs with difficulty levels 7-9.

Scaling Data and Models for Long-CoT Efficiency

The paper investigates the impact of Long-CoT dataset sizes and model scales on reasoning performance, focusing on mathematical tasks. Evaluation datasets include Math-hard, Olympiad-Bench, College_Math, High_School_League-24, and AIME24. Key findings include:

Increasing Long-CoT data scale leads to performance improvements. For example, the 7B-Math-Instruct model improves from 36.8 to 41.0 with 4k Long-CoT examples, increasing to 45.4 at 1000k examples.
Larger models and math-specialized models demonstrate better performance in Long-CoT training.
Challenging sample problems play a pivotal role in the synthesis process of Long-CoT.

Reinforcement Learning

The paper explores Direct Preference Optimization (DPO), Proximal Policy Optimization (PPO), and REINFORCE++ to enhance reasoning performance. A rule-based reward model is used to filter positive/negative responses. Results indicate that DPO and PPO are effective, achieving an average performance of 47.2, while REINFORCE++ shows variability across tasks. The reward function $r(x, \hat{y}, y^*)$ is defined as:

$r(x, \hat{y}, y*) = \begin{cases} 1 & \text{if } \mathsf{verifier}(\hat{y}, y*) = \text{True} \ 0.0 & \text{if } \mathsf{verifier}(\hat{y}, y*) = \text{False} \ -1 & \text{if } \hat{y} ~~ \text{doesn't contain valid answer format} \end{cases}$

where:

$x$ is the input question
$\hat{y}$ is the generated response
$y^*$ is the ground truth answer
$\mathsf{verifier}$ is a function that checks if the generated response is correct

Experiments and Results

Experiments validate scaling methods and training strategies using the Qwen-32B-Instruct model. Context parallelism via Deepspeed-Ulysess is employed to handle memory overflow issues. The training data includes a 4k math-derived Long-CoT dataset and a 16k code-derived Long-CoT dataset. The results indicate:

RedStar models outperform Qwen models across most tasks, demonstrating the effectiveness of Long-CoT tuning.
Long-CoT tuning significantly boosts performance across multiple domains.
RedStar-DPO, which combines reinforcement learning, achieves the highest average score of 58.3.
RedStar achieves comparable results compared to closed-source APIs, including DeepSeek-R1, Kimi-Math, and open-sourced models, on Chinese Graduate Entrance Mathematics Test datasets, demonstrating language transfer capabilities.
RedStar achieves the highest average score (68.8) across STEM, reasoning, CommonSense, factuality, long-text generation, and SedarEval tasks.

Multimodal Experiments

The paper extends Long-CoT to multimodal LLMs (MLLMs) for geometric reasoning tasks, using the GeoQA training set. Intern2VL-8B is used as the base model for Long-COT instruction fine-tuning. Key findings include:

Fine-tuning with a small set of Long-CoT instructions synthesized from GeoQA-train surpasses the performance of models like GPT-4o/V.
Multimodal Long-CoT achieves performance similar to Geo170K in instruction fine-tuning, despite using less data.
MLLMs trained with multimodal Long-CoT can engage in fine-grained reflection on multimodal elements and reassess the authenticity of generated conditions, demonstrating superior generalization capability on out-of-distribution datasets like MathVista-GEO and Geometry3K.

Conclusion

The paper concludes that Long-CoT tuning improves reasoning performance with limited data, does not hinder general task performance, and benefits from RL-based training. Applying Long-CoT methods to vision-LLMs also results in improvements across multimodal tasks. Future work will focus on synthesizing high-quality Long-CoT datasets using instruction-tuned models and extending Long-CoT to additional complex benchmarks.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (14)

Tweets

https://twitter.com/betterestli/status/1881968642822426656