Pass@1 Score Overview
- Pass@1 score is a metric that quantifies the chance that one model-generated solution meets all verified test cases for a given problem.
- It is computed by averaging the binary outcomes of candidate outputs, making it directly interpretable but computationally intensive.
- While providing a clear performance signal in large-scale evaluations, Pass@1 can suffer from high variance and limited exploration in low-sample scenarios.
The Pass@1 score quantifies the probability that a single model-generated solution to a problem—often in code synthesis or mathematical reasoning—is fully correct according to a deterministic verifier or test suite. This metric is foundational for evaluating models in reinforcement learning with verifiable rewards (RLVR), code generation, and LLM reasoning settings. It is mathematically grounded as a special case of the more general pass@k metric, where pass@k measures the probability that at least one out of k independent samples is correct.
1. Formal Definition and Theoretical Foundation
Let denote a policy (typically an LLM or a code synthesizer) parameterized by , evaluating input prompt . The single-sample correctness probability (Pass@1) is
where “correct” is determined by an external verifier or an all-encompassing test suite. Pass@k generalizes this to i.i.d. samples:
For ,
In code synthesis benchmarks, if candidates are generated for each of problems and candidates pass all tests, then:
This concise formulation allows for empirical estimation via the observed fraction of correct samples (Yang et al., 11 Jun 2024, Walder et al., 21 May 2025).
2. Computation in Practice
In RLVR and code generation tasks, Pass@1 computation involves:
- Sample Generation: For each prompt or problem, produce outputs via an LLM under fixed sampling parameters.
- Verification: Each output is subjected to all test cases. Only fully correct outputs count as “passing” (binary outcome).
- Estimator: The empirical mean per problem is aggregated over the test set.
This protocol is computationally demanding due to the need for thousands of code compilations or executions per evaluation cycle (Yang et al., 11 Jun 2024, Walder et al., 21 May 2025).
3. Pass@1 in Policy Optimization and Exploration
Pass@1 underpins most RLVR objective functions. In policy optimization, it is targeted via standard policy-gradient estimators (REINFORCE, PPO). The unbiased estimator for the gradient is
where is the binary correctness of the th sample.
Though reliable for optimizing deterministic accuracy, pure pass@1 optimization focuses on single-mode exploitation. This can cause the model to collapse to low-entropy (highly deterministic) policies, impairing discovery of alternative correct responses. Such “exploration collapse” leaves the model susceptible to local minima and reduced solution diversity (Yu, 20 Nov 2025, Chen et al., 14 Aug 2025, Walder et al., 21 May 2025).
4. Benefits, Limitations, and Robustness
Pass@1 possesses several strengths:
- Direct Interpretability: The score gives the expected success probability for a user allowed only one model sample per problem (Yang et al., 11 Jun 2024).
- Oracle Grounding: It directly reflects semantically functional correctness as dictated by the test suite.
- Stability in Large Regimes: For sufficiently large samples and datasets, it robustly represents model performance.
However, its limitations are nontrivial:
- Variance in Small Sample Regimes: Single-draw Bernoulli variance yields unstable rankings unless many samples or tasks are evaluated (Hariri et al., 5 Oct 2025).
- No Uncertainty Quantification: Traditional Pass@1 does not yield credible intervals or principled decision rules for model comparison.
- Cost: Test execution is computationally intensive, especially with extensive benchmarks.
- Missed Coverage: It does not reward diversity or mode coverage—correct but rare solutions are systematically undercounted (Yu, 20 Nov 2025).
5. Role in Ranking, Evaluation, and Comparison
Pass@1 is the de facto metric for single-shot evaluation in code generation, mathematical reasoning, and RLVR. In code ranking contexts (e.g., with Top Pass (Lyu et al., 11 Aug 2024)), the primary objective is to maximize Pass@1 by optimizing the ranking function so that at least one correct candidate surfaces in the top position. Modifying standard classifiers to directly maximize Pass@1, rather than simple classification loss, can substantially enhance top-1 accuracy of generated solutions in practical systems.
In Bayesian frameworks, the posterior mean under a uniform Dirichlet prior is an affine transformation of Pass@1, making them order-equivalent for ranking purposes while providing robust credible intervals and improved sample efficiency (Hariri et al., 5 Oct 2025). This Bayesian replacement of Pass@1 enables more reliable model comparison, especially in compute-constrained or low-sample settings.
6. Pass@1 vs. Pass@k and the Exploration-Exploitation Dilemma
Though Pass@k (for ) is intuitively attractive for measuring performance over multiple samples, its optimization does not introduce fundamentally new learning directions; rather, it is a per-example positive reweighting of the Pass@1 gradient:
with scaling . Pass@k’s signal vanishes when exploration is most needed (low ), and the gap between pass@k and pass@1 collapses as the model overfits its highest-probability solution. Therefore, maximizing pass@1 (or pass@k) alone is mathematically insufficient for promoting exploration, motivating explicit entropy or diversity bonuses in RLVR (Yu, 20 Nov 2025, Chen et al., 14 Aug 2025).
7. Contemporary Alternatives and Recommendations
Recent research recommends:
- Employing pass@1 as an evaluation diagnostic rather than a direct RL objective in exploration-critical tasks (Yu, 20 Nov 2025).
- Augmenting pass@1 rewards with entropy regularization, count-based bonuses, or max-entropy RL objectives for effective mode coverage and exploration (Yu, 20 Nov 2025, Chen et al., 14 Aug 2025).
- Using Bayesian credible-interval–based protocols to replace pass@1 reporting, yielding stable, uncertainty-aware model rankings under sample and compute constraints (Hariri et al., 5 Oct 2025).
- Leveraging surrogate metrics such as CodeScore-R to efficiently approximate pass@1 without full test suite execution when such cost is prohibitive (Yang et al., 11 Jun 2024).
Pass@1 remains central for both academic comparison and practical deployment, but optimal reasoning and code generation performance increasingly depend on the adoption of exploration-promoting objectives and robust statistical evaluation frameworks.