Test-Time Scaling

Updated 23 June 2025

Test-time scaling is a paradigm in LLM development that enhances model performance by allocating extra computational resources during inference, rather than during training. This strategy enables models to achieve substantial gains in complex reasoning tasks by extending the duration or breadth of their "thinking" at test time—such as producing longer chain-of-thought traces or revisiting and refining prior reasoning steps—while keeping model weights and training data frozen. The advance is particularly effective even with highly sample-efficient supervised finetuning, establishing new benchmarks in mathematical and reasoning intensive problems.

1. Core Principles and Methodology

Test-time scaling departs from the traditional view that LLM capabilities scale primarily with training compute (more data, larger models). Instead, it holds model parameters and training data constant, and focuses on inference-time compute allocation. The essential mechanism involves increasing either the length of the generated reasoning sequence for an individual problem or the number of different solution paths explored before returning a final answer.

A central technique, termed budget forcing, provides precise control over inference compute. During decoding, the system can:

Impose a maximum reasoning token count: The generation is forcibly ended with a delimiter (such as "\n---\nFinal Answer:") if the predefined budget is reached.
Impose a minimum depth by suppressing termination: If a model attempts to conclude early, the system intercepts and appends a string like "Wait" to the generated reasoning trace, prompting the model to continue reflecting. This often results in the model rechecking or revising prior steps.

This approach is realized at the decoding algorithm level, without changing model architecture or requiring retraining, and is computationally lightweight to implement:

while not end_of_thinking_token and t < T_max:
    token = model.generate_next()
    trace.append(token)
    t += 1
    if token == END_OF_THINKING_DELIMITER and t < T_min:
        trace.append('Wait')
        t += 1
        continue
    if token == END_OF_THINKING_DELIMITER and t >= T_min:
        break

The outcome is a controllable, monotonic scaling of model performance—accuracy increases as more test-time compute is allocated, up to saturation.

2. Model Training and Data Curation

The approach is practically demonstrated using the s1K dataset and the Qwen2.5-32B-Instruct model:

s1K dataset formation: Out of ~59,000 candidate questions across mathematics, science, law, and logic, 1,000 were selected through rigorous filtering for quality (removal of malformed or trivial cases), difficulty (exclusion of problems solvable by SOTA models; preference for longer reasoning traces), and diversity (coverage of 50+ subject domains, with categorization using Mathematics Subject Classification tools).
Finetuning: A single supervised next-token prediction pass on the s1K dataset is sufficient to endow the base model with substantial latent reasoning capability.
Compute efficiency: Training required only 26 minutes on 16 NVIDIA H100 GPUs using PyTorch FSDP.

This process intentionally emphasizes quality over quantity, showing that high-level reasoning can emerge from minimal, thoughtfully curated supervision.

3. Quantitative Performance and Metrics

The paper introduces a set of formal metrics for evaluating test-time scaling:

Control: The precision with which inference-time compute is set ( $Control = 100\%$ for perfect budget control).
Scaling: The monotonicity and slope of accuracy as a function of increased test-time budget.
Performance: The maximal benchmark accuracy achieved under any test-time configuration.

$\text{Control} = \frac{1}{|\mathcal{A}|} \sum_{a \in \mathcal{A}} \mathbb{I}(a_{\text{min}} \leq a \leq a_{\text{max}})$

$\text{Scaling} = \frac{1}{\binom{|\mathcal{A}|}{2}} \sum_{a, b \in \mathcal{A}, b > a} \frac{f(b) - f(a)}{b - a}$

$\text{Performance} = \max_{a \in \mathcal{A}} f(a)$

where $\mathcal{A}$ denotes different test-time budgets and $f(a)$ is the accuracy at compute budget $a$ .

Empirical results

On AIME24, s1-32B accuracy increases from 50% (single-trace) to 57% when utilizing test-time scaling via budget forcing.
On MATH and AIME24 benchmarks, s1-32B outperforms the o1-preview model by as much as 27%.
Sample efficiency: s1-32B is trained only on 1,000 reasoning traces, yet matches or exceeds models refined on >800,000 RL/SFT samples.

Model	# Reasoning Examples	AIME24	MATH500	GPQA (Diamond)
o1-preview (OpenAI)	N.A.	44.6	85.5	73.3
s1-32B	1K	56.7	93.0	59.6
Qwen2.5-32B-Instruct	-	26.7	84.0	49.0
r1 (DeepSeek)	>>800K	79.8	97.3	71.5

Crucially, sequential scaling, where the model is deliberately prompted to continue reasoning ("Wait") and reflect, delivers clear advantages over parallel strategies such as best-of-N voting.

4. Implementation Aspects and Self-Correction

Budget forcing is particularly practical because it requires only changes to inference code:

There are no model architecture modifications.
No additional annotations, supervised reward models, or reinforcement learning are required.
The budget is enforced by adjusting prompt formatting and decode-time rules, making it highly flexible.

A notable benefit is the capability for self-correction: forcing the model to continue after its initial conclusion often leads it to revisit and fix prior errors, resulting in meaningful accuracy improvements (e.g., +7% on AIME24).

The technique also enables practitioners to extrapolate beyond training-time performance by adjusting the "thinking budget," directly exposing and leveraging the model's latent capacity for deep reasoning.

5. Open-Source Ecosystem and Community Impact

The approach is accompanied by the release of:

s1-32B model weights (finetuned Qwen2.5-32B-Instruct),
s1K dataset (difficult, diverse, and decontaminated reasoning examples),
Code and documentation for training, inference, and evaluation.

The repository is available at https://github.com/simplescaling/s1.

This transparency addresses the knowledge gap left by proprietary models like OpenAI's o1 series, providing a reproducible framework and benchmark dataset for the community. By demonstrating o1-level reasoning from only 1,000 high-quality examples, it also lowers the barrier for researchers without access to massive compute or closed data sources.

6. Significance and Research Implications

Test-time scaling represents a conceptual shift: achieving improved model accuracy not through increased training scale but by smarter, more flexible use of compute during inference. The paradigm establishes that a model's capability to solve difficult reasoning tasks can be unlocked with minimal data and a simple, robust post-hoc decode-time method.

By making all resources open-source, the work sets a new standard for reproducibility and fosters accelerated progress. The separation of data curation, finetuning, and test-time adaptation enables clearer ablation and benchmarking of each component. The methodology, metrics, and open tools serve as a reference for developing new algorithms, evaluation protocols, and contest-level reasoning systems built around principled test-time scaling.

Summary Table: Test-Time Scaling Fundamentals in the s1 Framework

Aspect	Details
Compute Allocation	Controlled only at inference (test time)
Key Mechanism	Budget forcing via decode-time token control/"Wait"
Training Data	1,000 diversified reasoning traces (s1K)
Implementation	No retraining/model change required for scaling
Relative Performance	Exceeds o1-preview on math/competition benchmarks
Self-correction	Achieved by iterative budget-forced reflection
Openness	Model, data, and code released for community use

Test-time scaling thus emerges as a tractable, reproducible, and highly sample-efficient approach for maximizing LLM reasoning capacity at deployment.

PDF Markdown Bookmark Chat (Pro)