GSM8K: Grade School Math Benchmark

Updated 28 October 2025

GSM8K is a curated benchmark of 8,500 manually reviewed grade school math word problems requiring multi-step reasoning and natural language explanations.
It introduces a verification-augmented approach where a generator samples multiple candidate solutions and a verifier assigns token-level correctness scores to enhance accuracy.
Advanced data augmentation and meta-reasoning strategies applied to GSM8K extend its use in analyzing LLM generalization, cultural biases, and multimodal limitations.

GSM8K is a curated benchmark of 8,500 high-quality, linguistically diverse grade school math word problems designed to rigorously evaluate the multi-step mathematical reasoning capabilities of LLMs. Each problem is written in natural language and typically requires between 2 and 8 steps for solution, utilizing elementary arithmetic (+, −, ×, ÷) and structured to probe a model’s ability to reason through distinctly non-templated scenarios and explain its reasoning in full sentences. GSM8K is divided into 7,500 training samples and 1,000 held-out test samples, with comprehensive quality control yielding an estimated error rate below 2% (Cobbe et al., 2021).

1. Dataset Structure, Design Principles, and Content

GSM8K’s unique characteristics result from deliberate manual construction, prioritizing diversity, clarity, and moderate difficulty. Each problem emulates scenarios solvable by a bright middle school student, emphasizing reasoning chains rather than rote calculation. Unlike scraped datasets, every sample underwent rigorous human review for errors, redundancy, and diversity. Problems are intentionally engineered to avoid pattern templating, resulting in high linguistic variance—as substantiated by their use for benchmarking LLM generalization with respect to unseen syntactic and semantic forms. Furthermore, solutions are recorded in natural language for each problem, enabling both evaluation of the answer and inspection of intermediate, interpretable reasoning steps.

Split	Number of Examples	Description
Train	7,500	Diverse, manually written
Test	1,000	Held-out, non-templated

Each solution annotates a complete stepwise chain, providing instrumentation for analyzing where LLMs fail, e.g., semantic misunderstanding versus calculation errors.

2. Verification-Augmented Learning: Generator and Verifier Model Paradigm

The original GSM8K paper inaugurated a verification-augmented pipeline, addressing the documented limitation that autoregressive models fail to self-correct on multi-step tasks (Cobbe et al., 2021). The procedure involves:

Fine-tuning a generator LLM (initialized from GPT-3) for several epochs.
Sampling a large number of candidate completions per problem (e.g., 100 candidates), labeling each as correct/incorrect based solely on whether it produces the correct numerical answer.
Training a verifier as a separate transformer to estimate the correctness probability, using the concatenated problem and candidate solution as input.

The verifier is optimized with both a verification loss (often MSE or logit-based) and the standard language modeling loss. Crucially, the best-performing verifier produces “token-level” correctness scores, yielding a value function $V(s_1, s_2, \dots, s_t) \approx P(\text{correct} | \text{problem}, \text{token sequence up to } t)$ , offering finer granularity and stronger regularization than solution-level verification. At inference, many candidate completions are again sampled and scored by the verifier, the highest-ranked output being selected. In certain configurations, a voting scheme among top-ranked samples further enhances reliability.

This strategy is empirically shown to provide a substantial performance boost, with a 6B-parameter model plus verification matching or slightly exceeding a finetuned 175B baseline—an uplift comparable to a 30× increase in effective capacity (Cobbe et al., 2021).

3. Scaling Laws and Data Augmentation Strategies

Recent studies have extended GSM8K utility beyond the original framework through diverse augmentation methods. For example, MuggleMath (Li et al., 2023) evaluates the impact of query evolution and response diversification, demonstrating log-linear scaling of in-domain accuracy with the volume of augmented queries. Query evolution applies procedural modifications (changing numbers, introducing fractions/percentages, adding conditional statements, and increasing conceptual complexity), while response augmentation expands the number of distinct stepwise solutions per query. The empirical performance improvement is approximately $y = 10.7 \cdot \log(x) + 13.2$ for LLaMA-7B, where $x$ is the number of augmented samples.

Similarly, PersonaMathQA (Luo et al., 2 Oct 2024) employs persona-driven rewriting (changing context and style according to occupation or background) and “reflection” on mistakes, producing a compact yet highly diverse dataset, which yields state-of-the-art results on GSM8K despite being smaller than other leading datasets such as MetaMathQA.

Other augmentation pipelines, such as the MetaMathQA (Yu et al., 2023), combine answer augmentation, rephrasing, and backward reasoning via self-verification and FOBAR masking, targeting both stepwise reasoning and model generalization. The effect is both to increase linguistic coverage and distribute difficulty, with diversity gains proven via embedding distance metrics.

4. Verifier-Driven Model Architectures and Performance

A critical finding in GSM8K-centric research is the outsized benefit of dedicating parameter budget to token-level verifiers versus simply scaling generation models (Liu et al., 2023). In TinyGSM, a synthetic dataset of 12.3M grade-school problems generated by GPT-3.5, a pair of 1.3B-parameter models—one generator, one verifier—achieves 81.5% accuracy on GSM8K, surpassing much larger models and even outperforming the GPT-3.5 teacher. Training the verifier with diversity in candidate generations (across temperature and checkpoint settings) proved essential for robust selection.

Additionally, recent work found that ensembling and majority voting (pass@k selection) further amplify the benefit of verifier-based approaches. The general strategy has been demonstrated to scale more efficiently with additional data and regularization, such as residual dropout.

5. Impact on Model Generalization, Reasoning Failures, and Evaluation Paradigms

GSM8K has served as the principal analytical tool for studying LLM reasoning failures. GSM8K-Zero (Chiang et al., 21 Jan 2024) constructs extraction-only analogs (problems where the answer is stated in the text, requiring no computation), revealing that both open and closed-source LLMs demonstrate excessive “over-reasoning,” producing chain-of-thought responses even when unnecessary. Such redundancy was shown to correlate negatively with accuracy, and the effect persisted even when models were prompted to omit stepwise reasoning if the problem was trivial.

Meta-reasoning frameworks such as MR-GSM8K (Zeng et al., 2023) transform the evaluation paradigm by requiring not only solution production but also “reasoning about reasoning:” models critique their answers, identify first logical errors, and explain mistakes. Results indicate a pronounced disparity between raw answer accuracy and meta-cognitive diagnostic capabilities, with some state-of-the-art models exhibiting twenty-point differences in “MR-Score,” highlighting the distinction between shallow and deep reasoning competencies.

Scheherazade (Miner et al., 30 Sep 2024) further extends GSM8K by algorithmically chaining multiple problems to produce benchmarks testing logical dependency and long-range integration. While single GSM8K problems now approach saturation (>94%), performance sharply declines on chained variants, especially when backward reasoning dependencies are introduced.

6. Special Topics: Cultural Adaptation, Multimodality, and Robustness

GSM8K has also been adapted to explore performance sensitivity to cultural shifts and multimodal reasoning. The “Mathematics Isn't Culture-Free” study (Tomar et al., 1 Jul 2025) demonstrates that LLMs show significant accuracy drops on culturally re-templated GSM8K problems (e.g., changing names, currencies, scenarios for Africa, India, China, Korea, Japan), and that explicit chain-of-thought prompting can partially mitigate these gaps. This indicates potential model overfitting to training-set cultural cues and motivates development of culturally diverse benchmarks.

GSM8K-V (Yuan et al., 29 Sep 2025) introduces purely visual versions of each GSM8K problem, rendered by automated image-generation pipelines and validated with human annotation. Model evaluations show that text-based GSM8K is nearly saturated, but the best VLMs achieve only 46.93% accuracy on GSM8K-V, revealing modality-specific limitations—particularly in symbol grounding and sequential aggregation of visual facts.

7. Ongoing Implications and Future Directions

GSM8K remains the foundational benchmark for probing mathematical reasoning in LLMs. Methodologies initiated by its design—verification-based answer selection, chain-of-thought evaluation, query and response augmentation, and meta-reasoning—have been generalized to more challenging domains (e.g., MATH, CODE, visual benchmarks). Its continued adaptation to probe domain generalization, cultural robustness, and multimodal integration is driving research into model architectures, data augmentation strategies, and new evaluation metrics. GSM8K’s error annotation, diversity, and stepwise solution format offer essential instrumentation for the design of next-generation models capable of robust and interpretable reasoning across natural language, code, and visual modalities.