s1: Simple test-time scaling (2501.19393v3)

Published 31 Jan 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Test-time scaling is a promising new approach to LLMing that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct LLM on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1-32B with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1

Summary

The paper introduces test-time scaling with budget forcing to extend LLM reasoning, resulting in significant performance gains on reasoning-intensive tasks.
Using a curated dataset of 1,000 high-quality examples, the method achieves remarkable sample efficiency compared to approaches using massive datasets.
Ablation studies show that sequential reasoning with controlled token budgets outperforms parallel methods, offering actionable insights for model tuning.

This paper is about improving the performance of LLMs during inference by using extra computation at test time. In other words, instead of only relying on what the model learned during training, the paper shows that you can “scale up” its reasoning quality by extending its processing time when it is making predictions. The authors call this approach “test-time scaling” and introduce a method known as budget forcing. The work is also notable because it achieves strong, reasoning-intensive task performance using only a small set of carefully selected training examples.

Background and Motivation

LLMs are usually trained on huge amounts of text data. However, the ability to perform deep reasoning during tasks (like solving complex math questions, answering scientific queries, or performing logical deductions) can sometimes be improved by allowing the model to “think” longer when it is making its prediction. Traditional methods often focus on increasing the size of the model or the amount of compute during training. In contrast, this work explores how gradually increasing the computation during testing (inference) can lead to better answers.

Dataset Curation and Sample Efficiency

The authors started by gathering a very large pool of questions from diverse sources. They then filtered these questions based on three main criteria:

Quality: Ensuring that the questions and corresponding reasoning chains were of high quality, with correct formatting and clear explanations.
Difficulty: Selecting questions that require more in-depth reasoning. The idea is that hard problems need longer “thinking” chains, and the length of the reasoning chain can be a signal of difficulty.
Diversity: Making sure that the dataset covers multiple domains or topics so the model learns to handle a wide range of problems.

After these steps, they narrowed down a very large dataset to a curated set of 1,000 examples, which they refer to as s1K. Using just this small, high-quality subset for tuning, the authors showed that the model can become much better at solving reasoning-intensive questions. This part of the work underlines the benefit of sample efficiency—achieving strong performance with a far smaller amount of curated data compared to other methods that rely on tens or hundreds of thousands of examples.

Test-Time Scaling with Budget Forcing

A central contribution of the paper is the method called budget forcing. At test time, the LLM generates a “reasoning trace,” which is a series of intermediate tokens that represent the model’s thought process. The idea behind budget forcing is twofold:

Enforcing an Upper Limit: If the model produces more “thinking” tokens than a desired budget, the system forcefully ends the reasoning portion by adding a special token. This helps the model decide that it has done enough thinking and should now output an answer.
Encouraging Longer Reasoning: When the goal is to have the model think longer, the method suppresses the token that signals the end of the thought process. Additionally, it appends a command (for example, “Wait”) to the generated text. This simple intervention causes the model to produce more tokens, effectively giving it more time to double-check or improve its reasoning steps.

The approach is designed so that, by varying the amount of extra computation (or the length of the reasoning trace), the performance of the model improves gradually. The paper shows that with more test-time compute, the model can often correct mistakes in its reasoning and arrive at better answers.

Evaluation and Comparison with Other Methods

The authors evaluated their method on several challenging benchmarks that include competition-level math problems and difficult scientific questions. They compared their tuned model, referred to as s1-32B, against models from other research groups as well as closed-source alternatives.

Key findings include:

Scaling Behavior: By gradually increasing the computation allowed at test time (using the budget forcing method), the model’s performance improves in a predictable, nearly linear fashion up to a point.
Sample Efficiency: Even though s1-32B was fine-tuned on only 1,000 high-quality examples, it achieved performance that approaches or exceeds that of some models trained on orders of magnitude more data.
Sequential versus Parallel Methods: The paper also discusses and compares different strategies for improving test-time reasoning. While some methods use parallel sampling (generating multiple independent answers and then choosing the best one), the sequential strategy—where the model builds on its previous reasoning output—seems to be more effective in this context.

Ablation Studies and Insights

The paper includes detailed ablation studies to understand which factors are most important for the model’s success:

Data Selection Criteria: Experiments showed that relying on any one of the criteria (quality, difficulty, or diversity) alone does not work as well as combining them. The careful selection of samples is crucial for the observed improvements.
Budget Forcing Techniques: Different strings (like “Wait” or other similar prompts) were experimented with to see how they affect performance. The studies reveal that the specific way in which the model is encouraged to continue thinking can have a significant impact on the final performance.

In addition to budget forcing, the authors explored other methods, such as rejection sampling (where the model repeatedly generates outputs until an answer fits a certain length) and different types of conditional control (token, step, and class-conditional prompts). The results indicated that the simple budget forcing approach not only provides perfect control over the amount of test-time computation but also leads to a strong and consistent increase in performance.

Discussion and Broader Impact

This work challenges the traditional view that only increasing training size or model parameters can lead to improved performance in reasoning tasks. Instead, the paper illustrates that by intelligently managing the computation at test time—with minimal additional training data—it is possible to substantially boost performance. The methodological contributions offer a new direction for building more efficient and capable reasoning models. This has implications for applications where real-time performance and computational efficiency are important.

Moreover, the research demonstrates that complex reasoning capabilities may already be latent in large, pre-trained models. The role of fine-tuning with carefully selected data is to unlock or “activate” these capabilities. The approach also provides a transparent, open framework that could enable other researchers to build on these ideas, potentially leading to safer and more interpretable LLMs.

Conclusion

In summary, the paper presents a simple yet effective method—budget forcing—for improving the performance of LLMs by extending their reasoning process at test time. With just 1,000 carefully curated examples and a strategic method to control the computation during inference, the approach achieves strong results on difficult reasoning tasks. The work highlights the importance of sample efficiency and provides valuable insights into how test-time compute can be scaled to further enhance the capabilities of LLMs.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - simplescaling/s1: s1: Simple test-time scaling (3 stars)

Tweets

https://twitter.com/sytelus/status/1887136408546198009

https://twitter.com/Muennighoff/status/1886405528777073134

https://twitter.com/stanfordnlp/status/1900967735703859263

https://twitter.com/mustafasuleyman/status/1894155910857331099

https://twitter.com/chai_research/status/1887360805773189613

https://twitter.com/s_scardapane/status/1899045359357137146

s1: Simple test-time scaling (2501.19393v3)

Summary

Related Papers

GitHub

Tweets

YouTube

HackerNews

Reddit