GSM8K Benchmark: Evaluating Math Reasoning

Updated 23 August 2025

GSM8K Benchmark is a curated dataset of 8,500 grade-school math problems designed to assess multi-step arithmetic reasoning in language models.
It introduces a generation plus verification approach that enhances accuracy, equating to performance gains similar to a 30× increase in model size.
Variants like GSM-PLUS and MR-GSM8K extend the benchmark to test robustness against data contamination and to expose detailed reasoning processes.

The GSM8K benchmark is a widely adopted dataset and evaluation protocol for advancing and measuring progress in mathematical reasoning ability in LLMs. Designed to diagnose the capabilities and limitations of contemporary models on multi-step arithmetic word problems at grade-school level, GSM8K has catalyzed research in reasoning verification, prompt engineering, continual learning evaluation, and scalable training methodologies. As both a standard for measuring progress and a foundation for numerous methodological innovations, it has also inspired variants and critique concerning data contamination, robustness, and dataset design.

1. Dataset Design and Core Properties

GSM8K consists of 8,500 grade-school mathematics word problems, partitioned into 7,500 training examples and 1,000 test examples. Each problem is linguistically diverse, hand-crafted to avoid superficial templates or excessive regularity, and requires two to eight steps of elementary arithmetic—addition, subtraction, multiplication, or division—to solve. Solutions are written in natural language, typically in a chain-of-thought (CoT) style that exposes the full reasoning sequence, not just the answer. The dataset’s questions remain within the conceptual grasp of a bright middle school student, while still presenting a significant challenge to modern transformer-based LLMs (Cobbe et al., 2021).

Distinctive features of GSM8K include:

High linguistic diversity: The problems feature substantial variation in phrasing and structure.
Error minimization: Manual curation ensures an extremely low rate of ambiguous or flawed problems.
Moderate difficulty and multi-step structure: Problems are neither trivial nor excessively complex, with arithmetic steps explicitly required.
Transparent solution annotations: Natural language reasoning chains offer an interpretable window into model errors and reasoning processes.

This design enables granular error analysis and supports the development and comparison of advanced reasoning strategies.

2. Challenges for Transformer-Based LLMs

Despite the conceptual simplicity of GSM8K’s problem distribution, transformer models—even at very large scales—have historically struggled to achieve high accuracy on the benchmark (Cobbe et al., 2021). Two central difficulties are identified:

Sensitivity to Intermediate Errors: An arithmetic error in an early step often renders subsequent steps irretrievable, a phenomenon accentuated by the lack of feedback or recovery in autoregressive token generation.
Scaling Inefficiency: Both increased model size and data scale fail to yield proportional improvements, suggesting that naively upscaling standard supervised learning is suboptimal for this domain.

Furthermore, classic fine-tuning approaches tend to overfit, leading to overconfident or poorly calibrated predictions, and are especially brittle when multiple candidate completions are sampled.

3. Verification-Based Solution Paradigm

In response to these challenges, GSM8K became a testbed for “generation plus verification” approaches that decouple solution synthesis from evaluation. The canonical method involves two components:

Generator: A model trained on the dataset generates multiple candidate solutions for a given question.
Verifier: An independently trained model evaluates each candidate, outputting a score or probability for correctness conditioned on the question and the candidate solution.

At test time, a sizable set of candidate completions are sampled from the generator (often with high-temperature sampling to promote diversity), and the verifier selects the candidate with the highest probability of correctness. The final decision rule is: $\hat{S} = \arg\max_{S \in \mathcal{S}} V(P, S)$ where $\mathcal{S}$ is the set of candidate solutions, $P$ the problem, and $V(P, S)$ the verifier’s predicted probability (Cobbe et al., 2021).

Experiments demonstrate that verification can yield accuracy gains equivalent to a ~30× model size increase compared to classic finetuning, is robust to increased training data, and improves calibration. Two distinct verifier architectures have been studied: solution-level verifiers (which operate on the final solution) and token-level verifiers (which track per-token correctness); the latter ultimately achieves better generalization and is less prone to overfitting.

4. Performance Benchmarks and Scaling Laws

Methodologies based on candidate generation and verification—rather than simple argmax sampling—have set state-of-the-art results on GSM8K for both large and small models. For example, by pairing high-quality synthetic datasets with verification, models as small as 1.3B parameters have achieved over 80% GSM8K accuracy, matching or exceeding closed-source baselines that are an order of magnitude larger (Liu et al., 2023). Empirical findings further reveal that boosting verifier capacity yields greater gains than scaling the generator alone. This demonstrates that parameter-efficient validation of reasoning chains is more effective than reliance on mere model size increases.

A selection of results is summarized below:

Configuration	Model size	Data	GSM8K accuracy
Finetuned transformer (vanilla)	6B	Natural, GSM8K	<70%
Generation + verification (2-stage)	6B	Natural, GSM8K	>175B vanilla
TinyGSM gen+verifier	1.3B+1.3B	Synthetic, TinyGSM	81.5%
DUP-augmented prompting (GPT-4)	-	Natural, GSM8K	97.1% (zero-shot)
DiversiGATE SelfLearner, phase II	-	Natural, GSM8K	61.8%

5. Robustness, Overfitting, and Data Contamination

Recent analysis interrogates whether high GSM8K accuracy always signals true reasoning ability rather than dataset familiarity. Construction of GSM1k—a comparable but distinct dataset written by human annotators—shows that many models exhibit accuracy drops of up to 13% on GSM1k relative to GSM8K, with strong statistical correlation between a model’s GSM8K–GSM1k gap and its log-likelihood of reproducing GSM8K text, evidencing dataset contamination (Zhang et al., 1 May 2024). This effect is especially pronounced in models believed to have been exposed to GSM8K during training, but frontier models (e.g., GPT-4, Claude) display minimal signs of memorization and generalize well.

Additionally, adversarial perturbations (GSM-PLUS), compositional benchmarks, and chained-problem settings (Scheherazade, compositional GSM) highlight that state-of-the-art models often fail dramatically when superficial question variants, distractors, chaining, or multi-hop dependencies are introduced (Li et al., 29 Feb 2024, Miner et al., 30 Sep 2024, Hosseini et al., 2 Oct 2024). Performance on these “robustness” splits is often 10–20% lower than on standard GSM8K.

6. Extensions, Variants, and Derivative Benchmarks

GSM8K has inspired a diverse ecosystem of derivative datasets and benchmarks:

GSM-PLUS extends GSM8K with adversarial perturbations—e.g., numerical changes, distractor insertions, reversals of operations—to evaluate question-format robustness (Li et al., 29 Feb 2024).
MR-GSM8K transforms problem-solving into a meta-reasoning task that requires error localization and explanation, thereby surfacing reasoning process errors otherwise invisible to pass@1 metrics (Zeng et al., 2023).
GSM8K-aug supports multi-shot prompt compression benchmarking by allowing for variable numbers of chain-of-thought demonstrations in the prompt (Ali et al., 30 Mar 2024).
Compositional GSM/Scheherazade chain multiple problems in sequence or via conditional dependencies, quantifying explicit reasoning gaps in compositional or multi-hop settings (Hosseini et al., 2 Oct 2024, Miner et al., 30 Sep 2024).
MathClean introduces synthetic error identification and categorization using GSM8K as seed, enabling detailed evaluation of data cleaning for mathematical reasoning problems (Liang et al., 26 Feb 2025).

These variants collectively challenge models to move beyond memorization or superficial pattern-matching toward genuine, robust mathematical reasoning.

7. Impact and Future Directions

GSM8K’s precise design motivates a methodological shift from traditional supervised learning to two-stage verification, preference optimization on reasoning traces, and compositional prompting. Its adoption as a de facto standard for multi-step elementary math supports analysis of scaling laws, error propagation, and reasoning calibration.

However, as model accuracy on canonical GSM8K approaches human-level or saturates, emerging benchmarks such as Omni-MATH—an Olympiad-level, multi-domain and multi-difficulty collection—are required to meaningfully differentiate the best models (Gao et al., 10 Oct 2024). Research now frequently focuses on:

Meta-reasoning evaluation to probe process understanding, not just answer generation.
Preference optimization and reward modeling over reasoning traces, enhancing stepwise arithmetic rigor (Lahlou et al., 23 Jun 2024).
Transfer and generalization testing via new, contamination-controlled benchmarks like GSM1k and synthetic data cleaning tools (MathClean).
Hierarchical and bilingual benchmarks (e.g., MathBench) to assess deeper theory and cross-lingual capabilities (Liu et al., 20 May 2024).

A plausible implication is that benchmarks must evolve continuously in both format and coverage, and that the GSM8K lineage will likely persist as the basis for deriving and evaluating such next-generation mathematical reasoning datasets. The field is also moving toward evaluation frameworks that prioritize process transparency, reasoning trace robustness, and the ability to self-diagnose errors, rather than maximizing raw answer accuracy alone.