GSM8K Test Pass@1 Evaluation
- GSM8K Test Pass@1 is an evaluation metric that calculates the percentage of correctly solved grade-school problems by language models in a single generation.
- It is computed by dividing the number of correctly answered problems by the total test cases, reflecting top-1 accuracy using greedy or stochastic decoding.
- Innovations such as verifier selection, SLOT optimization, and dual-architecture models enhance Pass@1 performance without requiring massive parameter scaling.
GSM8K Test Pass@1 is a widely adopted evaluation metric for assessing the mathematical problem-solving capability of LLMs on the GSM8K benchmark, a dataset composed of grade-school level word problems and corresponding stepwise solutions. Pass@1 denotes the probability that a model, when prompted once, produces an answer string that exactly matches the reference solution for each problem in the test set. As such, it reflects single-sample, top-1 accuracy under standard evaluation protocols.
1. Definition and Computation of Pass@1 on GSM8K
Pass@1 on GSM8K is formally defined as the average fraction of problems for which the model’s first generated answer (typically determined by greedy or stochastic decoding) matches the ground-truth solution string, typically the numerical answer found at the end of a structured derivation. Let denote the number of problems in the GSM8K test set (1,319), and the number of problems answered correctly. Then: This metric is a specific case () of the generalized pass@K estimator introduced by Cobbe et al. (2021) and used uniformly in recent literature. Pass@1 can also be computed for multiple samples per-input, but most GSM8K evaluations employ exactly one generation per problem, emphasizing the accuracy of the model’s most probable or first sample (Hu et al., 18 May 2025).
2. Benchmarks and Architectures: Reported GSM8K Pass@1 Results
Recent work has established clear baselines and breakthrough points on GSM8K Pass@1 using neural models of various parameter scales, dataset synthesis, and model architectures:
| Model & Configuration | Params | Decoding Format | GSM8K Pass@1 | Source |
|---|---|---|---|---|
| Llama-2 | 34B | NLP | 42.2% | (Liu et al., 2023) |
| MAMMoTH (Code-Llama) | 34B | Code | 72.7% | (Liu et al., 2023) |
| ToRA (CoT-code) | 34B | Code | 80.7% | (Liu et al., 2023) |
| WizardMath | 70B | NLP | 81.6% | (Liu et al., 2023) |
| MetaMath | 70B | NLP | 82.3% | (Liu et al., 2023) |
| TinyGSM (1.3B gen + 1.3B ver, verify₄₈@1) | 1.3B x2 | Code+Verify | 81.5% | (Liu et al., 2023) |
| Qwen2.5-7B, few-shot CoT | 7B | NLP | 57.54% | (Hu et al., 18 May 2025) |
| Qwen2.5-7B + SLOT (T=3) | 7B | NLP | 66.19% | (Hu et al., 18 May 2025) |
| GPT-2 (co-finetuned dual-arch, ) | 124M | Hybrid latent | 31.5% | (Coda-Forno et al., 1 Oct 2025) |
| Qwen-3 (co-finetuned dual-arch, ) | 0.6B | Hybrid latent | 38.6% | (Coda-Forno et al., 1 Oct 2025) |
Evaluation details, prompt formats, decoding temperatures, and selection protocols differ by work, but Pass@1 is uniformly interpreted as the single-shot accuracy metric on GSM8K test.
3. Methodological Innovations Affecting Pass@1
Three principal methodological advances inform the state-of-the-art in GSM8K Pass@1:
- Verifier Selection: TinyGSM demonstrates that using a dedicated verifier network to select the top candidate from a batch of model generations (“verify₄₈@1”) can elevate pass@1 from ∼68% (finetuned 1.3B generator alone) to 81.5%—performance on par with much larger LLMs. The verifier is trained to evaluate candidate solutions at the sequence level by executing generated code and comparing their answer to the gold label (Liu et al., 2023).
- Sample-Specific Test-Time Optimization (SLOT): Instead of conventional inference, SLOT performs a lightweight, per-prompt optimization of a vector added to the model’s final hidden layer. For Qwen2.5-7B, three steps of such adaptation raise GSM8K Pass@1 from 57.54% to 66.19%, with ablations confirming robustness for –$5$ steps and learning rates in (Hu et al., 18 May 2025).
- Latent-Reasoning Architectures: Dual-module models with explicit “Base” and “Coprocessor” (as in (Coda-Forno et al., 1 Oct 2025)) have been directly compared to single-model “soft-embedding” baselines. On GSM8K, pass@1 for co-finetuned dual-architecture models (H2) maxes at 31.5% for GPT-2 scale, and 38.6% for Qwen-3, only marginally ahead of the soft-embedding single-model baseline (26.5%, 38.5% respectively). There is no statistically significant improvement with increased latent-token budget, and latent subspace analysis suggests limited specialization, with nearly all variance shared among slots.
4. Evaluation Protocols and Result Interpretation
Standard GSM8K Pass@1 evaluations employ the following protocol:
- Each test problem is presented as a stand-alone prompt, generally formatted for chain-of-thought or code-based solution output as appropriate for the underlying model.
- Only the first decoded answer is scored. Decoding is often greedy or uses a fixed temperature, with no re-ranking unless a verifier is present.
- In the “verify₄₈@1” protocol, 48 model samples are produced per problem, scored on the final token by an auxiliary verifier, and the top candidate selected for Pass@1 computation.
Ablations on model scale, sample count, and verifier/generator configurations in TinyGSM confirm that the majority of gains derive from the use of a verifier and high-quality synthetic data generation, rather than solely from scaling the generator. In SLOT, ablations confirm that per-sample adaptation rapidly saturates, with most improvement in 1–5 optimization steps. Dual-architecture reasoning models are limited chiefly by lack of latent specialization, not by channel capacity.
5. Factors Impacting GSM8K Pass@1 and Limitations
Key factors influencing GSM8K Pass@1 include:
- Data Quality and Scale: Large, diverse, and high-quality training data—especially synthetic, code-augmented solutions—enable small models to rival or surpass much larger ones by supporting generalization and precise arithmetic (Liu et al., 2023).
- Verifier Architecture and Diversity: The scale and diversity of the verifier, plus heterogeneity in the candidate generation pool, have a strong effect on the final pass@1. Using multiple training checkpoints and sampling temperatures increases the chances of correct selection.
- Code Execution as Supervision: Models trained to emit executable Python code instead of pure text are less prone to arithmetic errors and better at producing verifiable, step-by-step derivations.
- Test-Time Adaptation: Approaches such as SLOT, which optimize a small auxiliary parameter vector per prompt, directly enhance model alignment to complex, underrepresented instructions without catastrophic forgetting.
However, several limitations persist:
- Verifier-based strategies depend on code execution and may not generalize to non-code or purely symbolic reasoning tasks.
- Dataset contamination remains a residual concern with high-volume synthetic data, even after n-gram decontamination.
- The gap between these approaches and proprietary models such as GPT-4 (reporting 97% pass@1) remains substantial on harder multi-step or compositional problems.
6. Implications for Model Design and Future Directions
Results to date on GSM8K Pass@1 underscore several broader implications:
- System scale is not a strict requirement: Parameter-efficient models, when augmented with large synthetic datasets and verifier-based output selection, can achieve pass@1 >80%—comparable to large-scale proprietary models for this domain.
- Auxiliary optimization (verifiers, SLOT) is effective: Lightweight test-time inference modifications or verification modules are often as or more effective than architectural changes or increased latent capacity.
- Latent reasoning objectives require further development: Dual-architecture and latent-token expansion strategies, as tested, do not yet capture the algorithmic planning and modularity hypothesized to underlie “System 2” reasoning. High overlap in latent subspace suggests limited specialization.
- Verifier-based selection and synthetic data are broadly applicable: The verifier paradigm and synthetic code/data generation approach are transferable to other math and logic benchmarks; they suggest a route for continued improvement without substantial scaling.
- Remaining challenges: Bridging the remaining accuracy gap on GSM8K and related datasets will require advances in dataset generation, cross-domain generalization, and more interpretable or targeted forms of per-prompt adaptation and reasoning-space modularity.
7. Comparison with Related Metrics and Benchmarks
While Pass@1 remains the principal metric due to its simplicity and operational relevance (single best-shot accuracy), many works also report pass@K (for ), especially in the context of code and math reasoning. Notably, recent analyses identify a tension: optimization techniques (such as RLVR) that strongly favor Pass@1 may reduce pass@K for due to distributional over-concentration on the top candidate (Peng et al., 16 Oct 2025). Targeted interventions, such as SimKO, have been proposed to mitigate this effect, but no GSM8K test pass@1 results are directly available from these frameworks, as GSM8K is used for training only in their evaluations.
An important consideration is that all comparative results must attend to prompt format (code vs. chain-of-thought), dataset decontamination, and sample selection/verification protocol, as these factors generate significant variance across reported pass@1 in the literature.