s1K Dataset Benchmark

Updated 8 August 2025

s1K Dataset is a highly curated benchmark of 1,000 reasoning examples pairing challenging questions with explicit chain-of-thought traces and answers.
It introduces budget forcing to precisely control compute allocation, enhancing reasoning accuracy and enabling systematic scaling experiments.
The dataset supports sample-efficient fine-tuning, evidenced by the s1-32B model's strong performance in competition math and multi-step inference tasks.

The s1K dataset is a curated benchmark of 1,000 reasoning-intensive examples, each comprising a question, a detailed reasoning trace, and an answer. Designed for evaluating and training transformer-based LLMs on challenging reasoning tasks, s1K supports methodological advances in test-time scaling and controlled inference. The dataset was developed in the context of research on test-time compute scaling in LLMs and is accompanied by open-source code, models, and formal evaluation metrics (Muennighoff et al., 31 Jan 2025).

1. Dataset Structure and Curation

The s1K dataset consists of carefully selected triplets: (1) a challenging question, (2) an explicit reasoning trace (chain of thought), and (3) the final answer. These examples are distilled from an initial pool of approximately 59,000 questions, resulting in a compact set for sample-efficient model training and evaluation. Each trace documents the logical steps or calculations leading to the answer, facilitating supervised learning and transparent model assessment.

Curation relies on three explicit criteria:

Quality: Examples exhibiting formatting errors, API failures, or other issues are excluded, ensuring reliability and clarity of reasoning.
Difficulty: Selection excludes problems solvable by simpler models and emphasizes samples with longer reasoning chains, reflecting substantial reasoning complexity.
Diversity: Subject domains are classified and sampled uniformly, leveraging tools like the Mathematics Subject Classification and extensions to other reasoning fields, to avoid dataset bias toward any single area.

Ablation experiments confirm that all three criteria are essential; variants using only one or two performed notably worse on reasoning benchmarks. Selecting only the lengthiest examples or random subsamples does not recapitulate the unique properties of s1K; the full set achieves superior sample-efficiency.

2. Data Domains and Content

s1K draws from a broad spectrum of reasoning-heavy questions, notably including competition mathematics (e.g., AIME24, MATH), challenging science queries, and tasks requiring multi-step logical inference. The uniform sampling process guarantees representation across problem types, with no domain dominating the composition.

A typical example from s1K features:

A mathematically nontrivial problem,
A multi-step chain-of-thought solution annotated in natural language,
The answer, provided with clear formatting.

This structure aligns with the supervised fine-tuning paradigm in LLM development, enabling mapping from (question, reasoning trace) to answer with explicit step traceability.

3. Methodological Innovations: Budget Forcing

Budget forcing is introduced as a test-time control paradigm for scaling inference compute in transformer models. It operates in two modes:

Compute Termination: Generation is ended after a predetermined budget of thinking tokens, typically by appending a special token or an explicit prompt (e.g., "Final Answer:").
Compute Extension: When the model attempts termination, appending a nudge such as "Wait" causes continued generation. This strategy not only increases compute but also prompts "double-checking," with the model revisiting or refining its previous reasoning.

Budget forcing allows precise control over the number of "thinking tokens" allocated per question, supporting systematic ablation and performance scaling experiments.

4. Evaluation Metrics and Sample Efficiency

The paper formalizes evaluation with two primary metrics:

Scaling: Defined by the average slope of the accuracy curve as thinking tokens increase,

$\text{Scaling} = \frac{1}{\binom{|\mathcal{A}|}{2}} \sum_{a, b \in \mathcal{A}, b > a} \frac{f(b) - f(a)}{b - a}$

where $f$ maps the number of thinking tokens to accuracy.

Control: Quantifies the method's precision in constraining compute, with budget forcing shown to offer perfect control (100%).

Empirical results demonstrate that increased test-time compute via budget forcing improves reasoning accuracy, and that performance scales reliably with additional thinking tokens. Notably, s1K-trained models (e.g., s1-32B) achieve competitive results with a small training set, outperforming larger models (such as o1-preview) in key benchmarks (up to 27% improvement in competition math tasks).

5. Model Training and Resource Requirements

Supervised fine-tuning (SFT) on s1K is performed using Qwen2.5-32B-Instruct as the base LLM. The SFT process is highly efficient, requiring only 26 minutes on 16 NVIDIA H100 GPUs. Despite the modest dataset size, the resulting s1-32B model attains strong sample-efficiency, reaching or exceeding the performance of models trained with much larger datasets.

This efficiency highlights the benefit of high-quality, high-difficulty, and diverse examples for accelerating model capabilities in reasoning.

6. Openness and Reproducibility

The s1K dataset, all model weights (s1-32B), and the full codebase for both the budget forcing methodology and evaluation framework are released under open-source licenses. The repository is available at https://github.com/simplescaling/s1, supporting reproducibility and further experimentation by the research community.

7. Context and Implications in LLM Research

s1K establishes a benchmark for sample-efficient reasoning and test-time scaling in autoregressive LLMs. The explicit triplet format facilitates chain-of-thought learning and transparent evaluation. Budget forcing enables practical control over inference compute, supporting empirical studies of compute-accuracy relationships.

Formal ablation and efficiency analyses suggest that carefully curated, high-quality small datasets can yield strong reasoning benchmarks, informing future dataset construction and model training paradigms. The demonstrated results imply that emphasis on reasoning diversity and difficulty is crucial for maximizing the learning yield in high-performance transformer models.

PDF Markdown Chat (Pro)

References (1)

s1: Simple test-time scaling (2025)

Follow Topic

Get notified by email when new papers are published related to s1K Dataset.