s1: Simple test-time scaling

Published 31 Jan 2025 in cs.CL, cs.AI, and cs.LG | (2501.19393v3)

Abstract: Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct LLM on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1-32B with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper's main contribution is the introduction of budget forcing, a simple test-time intervention that controls compute allocation to enhance reasoning performance.
It details the creation of the s1K dataset, curated based on quality, difficulty, and diversity, to maximize sample efficiency with minimal data.
Experimental results on the s1-32B model demonstrate that sequential scaling with budget forcing outperforms traditional parallel methods in achieving competitive performance.

s1: Simple Test-Time Scaling

The paper "s1: Simple test-time scaling" (2501.19393) introduces a straightforward methodology for test-time scaling in LLMs, achieving strong reasoning performance with minimal training data and a novel technique called budget forcing. The key contributions revolve around curating a high-quality dataset and implementing a simple yet effective method to control test-time compute.

s1K Dataset Curation

The authors address the challenge of sample efficiency by creating s1K, a dataset of 1,000 question-reasoning trace pairs. This dataset is meticulously curated from an initial pool of 59K samples, emphasizing three key criteria:

Quality: Ensuring high-quality reasoning traces by filtering out formatting issues and API errors.
Difficulty: Selecting samples that challenge LMs, assessed by the performance of Qwen2.5-7B-Instruct and Qwen2.5-32B-Instruct.
Diversity: Covering a wide range of domains using the Mathematics Subject Classification (MSC) system.

Figure 1: Illustration of the s1K dataset and the sample-efficiency of the s1-32B model.

The selection process involves a three-stage filtering approach, prioritizing quality, difficulty, and diversity. This contrasts with other approaches that rely on large-scale reinforcement learning or SFT with tens of thousands of examples. The focus on data curation aligns with findings from instruction tuning, where the quality of data is paramount.

Test-Time Scaling with Budget Forcing

The paper introduces budget forcing, a simple test-time intervention to control the amount of compute a model expends on a given problem. This technique involves:

Forced Termination: Appending an end-of-thinking token delimiter to curtail the model's reasoning process.
Lengthening: Suppressing the end-of-thinking token delimiter and appending "Wait" to encourage further exploration.
Figure 2: An example of budget forcing, where the model is prompted to self-correct by appending "Wait".

This method allows for direct manipulation of the model's thinking duration at test time, enabling the exploration of performance scaling with compute. Budget forcing is classified as a sequential scaling method, where later computations depend on earlier ones.

Experimental Results and Analysis

The effectiveness of s1K and budget forcing is demonstrated through the development of s1-32B, a 32B parameter model finetuned on s1K. The model's performance is benchmarked against various state-of-the-art LMs, including OpenAI's o1-preview, DeepSeek r1, and Qwen's QwQ-32B-preview, using AIME24, MATH500, and GPQA Diamond datasets. The results show that s1-32B achieves competitive performance, particularly on AIME24, surpassing o1-preview by a significant margin with budget forcing. Ablation studies further validate the importance of difficulty, diversity, and quality in the s1K dataset.

Figure 3: Performance of s1-32B with budget forcing on various reasoning tasks as test-time compute is varied.

The paper defines key metrics for evaluating test-time scaling methods: Control, Scaling, and Performance. Budget forcing excels in providing perfect control and positive scaling, leading to strong overall performance. The results indicate that sequential scaling, exemplified by budget forcing, is more effective than parallel scaling methods like majority voting for Qwen2.5-32B-Instruct.

The budget forcing technique shows promising scaling trends and some extrapolation capabilities, as seen in (Figure 4a). Moreover, (Figure 5) illustrates how sequential and parallel scaling methods affect performance, highlighting the benefits of budget forcing over other techniques like REBASE and majority voting.

Ablation Experiments

Comprehensive ablation experiments are conducted to assess the impact of data curation and test-time scaling methods.

Data Ablations: Datasets created using only one or two of the criteria (quality, difficulty, diversity) perform worse than s1K, which combines all three.
Scaling Method Ablations: Budget forcing outperforms conditional length-control methods and rejection sampling in terms of scaling and control.
Figure 6: Performance of s1-32B on AIME24 using rejection sampling with different token limits.

Rejection sampling, surprisingly, exhibits an inverse scaling trend, potentially due to a correlation between shorter generations and the model being on the right track from the start.

Implications and Future Directions

The research demonstrates the feasibility of achieving strong reasoning performance and test-time scaling with limited data and a simple, interpretable method. The success of s1-32B suggests that pretrained models already possess substantial reasoning capabilities that can be activated through targeted finetuning and test-time interventions. The paper highlights the limitations of budget forcing, including context window constraints and the eventual flattening of performance gains. Future research directions include exploring improvements to budget forcing, such as dynamic string selection and combinations with frequency penalties, as well as investigating the application of budget forcing to models trained with reinforcement learning.

Conclusion

The paper "s1: Simple test-time scaling" (2501.19393) makes a compelling case for simple, data-efficient approaches to reasoning in LMs. By introducing s1K and budget forcing, the authors provide a recipe for achieving strong performance and test-time scaling with limited resources. The work's emphasis on transparency and open-source availability fosters further research and development in the field.

Markdown Report Issue