GSM1k Benchmark: Novel Arithmetic Reasoning Tests

Updated 10 June 2026

GSM1k is a human-curated dataset that assesses LLMs’ elementary arithmetic reasoning using novel, grade-school level arithmetic problems.
The dataset mirrors GSM8k by enforcing controlled difficulty levels and matching answer distributions through precise statistical techniques.
Evaluation metrics, including per-character log-likelihood and accuracy gaps, help distinguish genuine reasoning ability from overfitting and memorization.

The Grade School Math 1000 (GSM1k) benchmark is a rigorously constructed dataset intended to evaluate the elementary arithmetic reasoning capabilities of LLMs in a setting that is expressly free from dataset contamination. Developed as a direct analog to the established GSM8k benchmark, GSM1k serves both as a probe for true model generalization and as a control to detect overfitting and memorization that may have arisen from benchmark leakage. GSM1k’s construction, metrics, and evaluation protocols adhere closely to those of GSM8k, while introducing robust guarantees of novelty and statistical matching, enabling fine-grained analysis of reasoning ability versus dataset memorization (Zhang et al., 2024).

1. Dataset Construction and Statistical Properties

GSM1k comprises 1,250 grade-school–level arithmetic problems, curated exclusively by human annotators without the use of LLMs at any stage. Annotators were supplied with three representative GSM8k problems per task and instructed to synthesize entirely novel questions adhering to the stylistic and structural conventions of GSM8k. Each question can be solved using only the four basic arithmetic operators (addition, subtraction, multiplication, division), and all final answers are constrained to positive integers.

A central methodological pillar is the enforced “difficulty level” $N$ , defined by the number of elementary arithmetic steps (operationalized as “calculator” tags $\langle\langle\cdot\rangle\rangle$ ) in the step-by-step solution. This step count $L$ is explicitly controlled to mirror the distribution observed in the union of GSM8k’s train and test splits. Post-filtering was used to ensure the answer-magnitude cumulative distribution function (CDF) of GSM1k almost exactly overlaps with that of GSM8k, employing the Kolmogorov–Smirnov statistic to optimize this match. The resulting dataset thus parallels GSM8k not only in style and complexity but also in human-level solvability and quantitative answer properties.

The table below summarizes core statistical controls in GSM1k:

Attribute	GSM1k Specification	GSM8k Matching Method
Problem count	1,250	None (fixed size for GSM1k)
Operator set	$+, -, \times, \div$	Identical
Step-count (‘difficulty’ $L$ )	Controlled per problem	Matched to GSM8k histogram
Answer CDF	Post-filtered for overlap	Minimized Kolmogorov–Smirnov D

Human evaluation results confirm that GSM1k matches or slightly exceeds GSM8k in solvability (mean under 15 minutes: $4.36 \pm 1.11$ vs. $4.07 \pm 0.93$ solved per annotator, respectively). Moreover, distinguishability between GSM1k and GSM8k is near chance (21.83% identification rate vs. 20% chance), attesting to their stylistic indistinguishability.

2. Metrics and Overfitting Analysis

GSM1k’s primary performance metric is model accuracy, defined as the fraction of problems for which the extracted final integer matches the gold answer. Secondary metrics track the solution step-length $L$ and the distribution of answer magnitudes with strict CDF alignment.

To disentangle genuine reasoning from dataset contamination, models’ average per-character log-likelihood $\mathcal{L}$ on the GSM8k test set is computed:

$\mathcal{L} = \frac{1}{c}\sum_{i=1}^T \log p(x_i \mid x_{<i})$

where $\langle\langle\cdot\rangle\rangle$ 0 is the character count of the sequence $\langle\langle\cdot\rangle\rangle$ 1. The “performance gap” $\langle\langle\cdot\rangle\rangle$ 2 is defined as the difference in accuracy between the GSM8k and GSM1k test questions:

$\langle\langle\cdot\rangle\rangle$ 3

Statistical analysis reveals a positive correlation between $\langle\langle\cdot\rangle\rangle$ 4 and $\langle\langle\cdot\rangle\rangle$ 5, with Spearman’s $\langle\langle\cdot\rangle\rangle$ 6 ( $\langle\langle\cdot\rangle\rangle$ 7), Pearson’s $\langle\langle\cdot\rangle\rangle$ 8, and Kendall’s $\langle\langle\cdot\rangle\rangle$ 9. The slope indicates each 1% gap in accuracy corresponds to a $L$ 0 increase in per-character log-likelihood on GSM8k, an explicit memorization signal. This operational link supports the interpretation that sharper performance drops on GSM1k—relative to GSM8k—manifest in models likely to have encountered GSM8k problems in their training data.

3. Evaluation Protocol and Model Coverage

Evaluation follows a fixed protocol for all models. Each GSM1k or GSM8k problem is presented with a randomly chosen 5-shot chain-of-thought prompt, sampled from the GSM8k training corpus:

$L$ 3

Open-source models are run at temperature zero (greedy decoding); closed-source APIs use their default settings. Automatic answer extraction compares the final integer in each output to the gold answer, as implemented in EleutherAI’s LM Evaluation Harness.

Models evaluated span both closed-source (e.g., GPT-4, GPT-4 Turbo, GPT-3.5-Turbo, Gemini 1.5, Claude 2.1, and Claude 3 variants) and open-source families (Mistral small/medium/large, Mixtral, LLaMA2/3, Phi, Yi, Gemma, CodeLlama, Pythia, GPT-NeoX-20B, GPT-2-XL, among others).

4. Results: Accuracy, Overfitting, and Family-Wide Patterns

Accuracy on GSM1k uniformly lags GSM8k for most open-source and several proprietary models. The largest recorded drop is 13.4 percentage points for math-shepherd-mistral-7B-RL (74.5% GSM8k vs. 61.1% GSM1k). Drops of 8–10 points are common across the Mistral and Phi families (e.g., Mixtral-8x22B, Phi-3-mini). In contrast, leading closed-source models (GPT-4, Claude 3, Gemini) exhibit virtually no gap.

Systematic family-level overfitting is observed for Phi (phi-1, phi-2, phi-3-mini-4K/128K), Mistral & Mixtral, and—less so—Yi, Gemma, and CodeLlama derivatives. LLaMA2/3 and top closed-source models demonstrate no evidence of such overfit. A positive correlation ( $L$ 1) between GSM8k per-character log-likelihood and the performance gap supports the contamination hypothesis, though discrepancies (e.g., Mixtral v0.1 and v0.1-Instruct showing near-identical gaps despite disparate likelihoods) highlight that memorization is not the sole mechanism at play.

Despite overfitting signals, models remain substantially above random baseline: phi-3 (7B) drops ∼10 points from GSM8k but still manages 68% accuracy on GSM1k, approximating performance of much larger, non-overfit models. This suggests that overfit is not mutually exclusive with broad reasoning capability.

5. Generalization and Robustness Experiments

GSM1k’s “guaranteed-novel” construction is validated using pre-GSM8k LLMs. GPT-2-XL and GPT-NeoX-20B, both trained entirely prior to the release of GSM8k, show near-zero accuracy gap between GSM8k and GSM1k, confirming the absence of contamination and that GSM1k provides genuinely novel challenges.

A further reserved set of fully QC-passed, held-out problems (excluded from GSM1k) is used for pilot generalization studies. Here, leading models demonstrate qualitatively similar accuracy to the main GSM1k test set, indicating that top-performing LLMs have internalized generalizable arithmetic reasoning, not merely benchmark memorization.

Across all evaluated families, performance on novel problems scales smoothly with model capability. The highest-performing frontier models consistently exceed 95% accuracy on held-out arithmetic questions, whereas lower-capacity or fine-tuned variants perform proportionally less well. This illustrates that models with sufficient intrinsic reasoning capacity generalize beyond exposed benchmarks.

6. Reproducibility, Best Practices, and Implications

Methodological conclusions drawn from GSM1k emphasize the necessity of human-only dataset curation, step-count and answer distribution matching to prior benchmarks, and the retention of a small, novel test split for robust future evaluation. Recommended evaluation practice includes the fixed n-shot chain-of-thought prompting, zero-temperature decoding, and dual measurement of both raw accuracy and accuracy gap $L$ 2. Overfit analysis is strengthened by correlating accuracy differences with memorization proxies such as per-character log-likelihood.

Benchmarking should span both state-of-the-art closed-source and open-source model suites to allow contamination effects to be disambiguated from true gains in mathematical reasoning ability. GSM1k’s design and findings provide actionable guidance for researchers concerned with the integrity, validity, and future evolution of arithmetic and reasoning benchmarks in the era of LLMs (Zhang et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

A Careful Examination of Large Language Model Performance on Grade School Arithmetic (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GSM1k Benchmark.