GSM8K Test Pass@1 Evaluation

Updated 30 January 2026

GSM8K Test Pass@1 is an evaluation metric that calculates the percentage of correctly solved grade-school problems by language models in a single generation.
It is computed by dividing the number of correctly answered problems by the total test cases, reflecting top-1 accuracy using greedy or stochastic decoding.
Innovations such as verifier selection, SLOT optimization, and dual-architecture models enhance Pass@1 performance without requiring massive parameter scaling.

GSM8K Test Pass@1 is a widely adopted evaluation metric for assessing the mathematical problem-solving capability of LLMs on the GSM8K benchmark, a dataset composed of grade-school level word problems and corresponding stepwise solutions. Pass@1 denotes the probability that a model, when prompted once, produces an answer string that exactly matches the reference solution for each problem in the test set. As such, it reflects single-sample, top-1 accuracy under standard evaluation protocols.

1. Definition and Computation of Pass@1 on GSM8K

Pass@1 on GSM8K is formally defined as the average fraction of problems for which the model’s first generated answer (typically determined by greedy or stochastic decoding) matches the ground-truth solution string, typically the numerical answer found at the end of a structured derivation. Let $N$ denote the number of problems in the GSM8K test set (1,319), and $c$ the number of problems answered correctly. Then: $\mathrm{Pass@1} = \frac{c}{N} \times 100\%$ This metric is a specific case ( $K=1$ ) of the generalized pass@K estimator introduced by Cobbe et al. (2021) and used uniformly in recent literature. Pass@1 can also be computed for multiple samples per-input, but most GSM8K evaluations employ exactly one generation per problem, emphasizing the accuracy of the model’s most probable or first sample (Hu et al., 18 May 2025).

2. Benchmarks and Architectures: Reported GSM8K Pass@1 Results

Recent work has established clear baselines and breakthrough points on GSM8K Pass@1 using neural models of various parameter scales, dataset synthesis, and model architectures:

Model & Configuration	Params	Decoding Format	GSM8K Pass@1	Source
Llama-2	34B	NLP	42.2%	(Liu et al., 2023)
MAMMoTH (Code-Llama)	34B	Code	72.7%	(Liu et al., 2023)
ToRA (CoT-code)	34B	Code	80.7%	(Liu et al., 2023)
WizardMath	70B	NLP	81.6%	(Liu et al., 2023)
MetaMath	70B	NLP	82.3%	(Liu et al., 2023)
TinyGSM (1.3B gen + 1.3B ver, verify₄₈@1)	1.3B x2	Code+Verify	81.5%	(Liu et al., 2023)
Qwen2.5-7B, few-shot CoT	7B	NLP	57.54%	(Hu et al., 18 May 2025)
Qwen2.5-7B + SLOT (T=3)	7B	NLP	66.19%	(Hu et al., 18 May 2025)
GPT-2 (co-finetuned dual-arch, $N_L=12$ )	124M	Hybrid latent	31.5%	(Coda-Forno et al., 1 Oct 2025)
Qwen-3 (co-finetuned dual-arch, $N_L=12$ )	0.6B	Hybrid latent	38.6%	(Coda-Forno et al., 1 Oct 2025)

Evaluation details, prompt formats, decoding temperatures, and selection protocols differ by work, but Pass@1 is uniformly interpreted as the single-shot accuracy metric on GSM8K test.

3. Methodological Innovations Affecting Pass@1

Three principal methodological advances inform the state-of-the-art in GSM8K Pass@1:

Verifier Selection: TinyGSM demonstrates that using a dedicated verifier network to select the top candidate from a batch of model generations (“verify₄₈@1”) can elevate pass@1 from ∼68% (finetuned 1.3B generator alone) to 81.5%—performance on par with much larger LLMs. The verifier is trained to evaluate candidate solutions at the sequence level by executing generated code and comparing their answer to the gold label (Liu et al., 2023).
Sample-Specific Test-Time Optimization (SLOT): Instead of conventional inference, SLOT performs a lightweight, per-prompt optimization of a vector added to the model’s final hidden layer. For Qwen2.5-7B, three steps of such adaptation raise GSM8K Pass@1 from 57.54% to 66.19%, with ablations confirming robustness for $T=3$ –$5$ steps and learning rates in ${0.01, 0.05, 0.1}$ (Hu et al., 18 May 2025).
Latent-Reasoning Architectures: Dual-module models with explicit “Base” and “Coprocessor” (as in (Coda-Forno et al., 1 Oct 2025)) have been directly compared to single-model “soft-embedding” baselines. On GSM8K, pass@1 for co-finetuned dual-architecture models (H2) maxes at 31.5% for GPT-2 scale, and 38.6% for Qwen-3, only marginally ahead of the soft-embedding single-model baseline (26.5%, 38.5% respectively). There is no statistically significant improvement with increased latent-token budget, and latent subspace analysis suggests limited specialization, with nearly all variance shared among slots.

4. Evaluation Protocols and Result Interpretation

Standard GSM8K Pass@1 evaluations employ the following protocol:

Each test problem is presented as a stand-alone prompt, generally formatted for chain-of-thought or code-based solution output as appropriate for the underlying model.
Only the first decoded answer is scored. Decoding is often greedy or uses a fixed temperature, with no re-ranking unless a verifier is present.
In the “verify₄₈@1” protocol, 48 model samples are produced per problem, scored on the final token by an auxiliary verifier, and the top candidate selected for Pass@1 computation.

Ablations on model scale, sample count, and verifier/generator configurations in TinyGSM confirm that the majority of gains derive from the use of a verifier and high-quality synthetic data generation, rather than solely from scaling the generator. In SLOT, ablations confirm that per-sample adaptation rapidly saturates, with most improvement in 1–5 optimization steps. Dual-architecture reasoning models are limited chiefly by lack of latent specialization, not by channel capacity.

5. Factors Impacting GSM8K Pass@1 and Limitations

Key factors influencing GSM8K Pass@1 include:

Data Quality and Scale: Large, diverse, and high-quality training data—especially synthetic, code-augmented solutions—enable small models to rival or surpass much larger ones by supporting generalization and precise arithmetic (Liu et al., 2023).
Verifier Architecture and Diversity: The scale and diversity of the verifier, plus heterogeneity in the candidate generation pool, have a strong effect on the final pass@1. Using multiple training checkpoints and sampling temperatures increases the chances of correct selection.
Code Execution as Supervision: Models trained to emit executable Python code instead of pure text are less prone to arithmetic errors and better at producing verifiable, step-by-step derivations.
Test-Time Adaptation: Approaches such as SLOT, which optimize a small auxiliary parameter vector per prompt, directly enhance model alignment to complex, underrepresented instructions without catastrophic forgetting.

However, several limitations persist:

Verifier-based strategies depend on code execution and may not generalize to non-code or purely symbolic reasoning tasks.
Dataset contamination remains a residual concern with high-volume synthetic data, even after n-gram decontamination.
The gap between these approaches and proprietary models such as GPT-4 (reporting 97% pass@1) remains substantial on harder multi-step or compositional problems.

6. Implications for Model Design and Future Directions

Results to date on GSM8K Pass@1 underscore several broader implications:

System scale is not a strict requirement: Parameter-efficient models, when augmented with large synthetic datasets and verifier-based output selection, can achieve pass@1 >80%—comparable to large-scale proprietary models for this domain.
Auxiliary optimization (verifiers, SLOT) is effective: Lightweight test-time inference modifications or verification modules are often as or more effective than architectural changes or increased latent capacity.
Latent reasoning objectives require further development: Dual-architecture and latent-token expansion strategies, as tested, do not yet capture the algorithmic planning and modularity hypothesized to underlie “System 2” reasoning. High overlap in latent subspace suggests limited specialization.
Verifier-based selection and synthetic data are broadly applicable: The verifier paradigm and synthetic code/data generation approach are transferable to other math and logic benchmarks; they suggest a route for continued improvement without substantial scaling.
Remaining challenges: Bridging the remaining accuracy gap on GSM8K and related datasets will require advances in dataset generation, cross-domain generalization, and more interpretable or targeted forms of per-prompt adaptation and reasoning-space modularity.

While Pass@1 remains the principal metric due to its simplicity and operational relevance (single best-shot accuracy), many works also report pass@K (for $K>1$ ), especially in the context of code and math reasoning. Notably, recent analyses identify a tension: optimization techniques (such as RLVR) that strongly favor Pass@1 may reduce pass@K for $c$ 0 due to distributional over-concentration on the top candidate (Peng et al., 16 Oct 2025). Targeted interventions, such as SimKO, have been proposed to mitigate this effect, but no GSM8K test pass@1 results are directly available from these frameworks, as GSM8K is used for training only in their evaluations.

An important consideration is that all comparative results must attend to prompt format (code vs. chain-of-thought), dataset decontamination, and sample selection/verification protocol, as these factors generate significant variance across reported pass@1 in the literature.

Markdown Report Issue Upgrade to Chat

References (4)

SLOT: Sample-specific Language Model Optimization at Test-time (2025)

TinyGSM: achieving >80% on GSM8k with small language models (2023)

Exploring System 1 and 2 communication for latent reasoning in LLMs (2025)

SimKO: Simple Pass@K Policy Optimization (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GSM8K Test Pass@1.

GSM8K Test Pass@1 Evaluation

1. Definition and Computation of Pass@1 on GSM8K

2. Benchmarks and Architectures: Reported GSM8K Pass@1 Results

3. Methodological Innovations Affecting Pass@1

4. Evaluation Protocols and Result Interpretation

5. Factors Impacting GSM8K Pass@1 and Limitations

6. Implications for Model Design and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

GSM8K Test Pass@1 Evaluation

1. Definition and Computation of Pass@1 on GSM8K

2. Benchmarks and Architectures: Reported GSM8K Pass@1 Results

3. Methodological Innovations Affecting Pass@1

4. Evaluation Protocols and Result Interpretation

5. Factors Impacting GSM8K Pass@1 and Limitations

6. Implications for Model Design and Future Directions

7. Comparison with Related Metrics and Benchmarks

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research