Pass@$k$ Inference Scaling
- Pass@$k$ inference scaling is defined as the probability that at least one of k generated completions is correct, revealing diminishing returns as k increases.
- It employs statistical models like the Beta-Binomial framework to understand power-law decay and forecast performance improvements on reasoning benchmarks.
- Practical strategies such as dynamic sampling, entropy-aware generation, and minimax-optimal selection are recommended to balance compute efficiency with reliability.
Pass@ inference scaling refers to the empirical and theoretical paper of how predictive accuracy improves as the number of independent model-generated solutions per problem () is increased at inference time. The pass@ metric, closely associated with "coverage," measures the probability that at least one out of generated completions is correct for a given task instance. Pass@ scaling is central to reasoning benchmarks, generative code and math evaluation, and analysis of reliability under computational constraints. A comprehensive understanding of this phenomenon integrates statistical theory, algorithmic strategies, limitations, and practical guidance, culminating in a sharp characterization of the trade-offs and pitfalls underlying inference scaling for LLMs and other generative systems.
1. Mathematical Formulation of Pass@ and Scaling Behavior
The core definition for pass@ on a given input is: where is the number of candidate generations and the number of correct samples among them (as per a reference solution or automated checker). For repeated sampling with replacement and independent per-sample accuracy , the more familiar form is: Averaged over the dataset , the empirical estimator is: This functional form yields a monotonic, concave curve in , exhibiting diminishing returns as increases; specifically, the marginal gain is determined by the tail of the per-problem probability distribution . As , for discrete answer spaces, converges to the fraction of tasks with nonzero single-trial accuracy, regardless of the reliability of these probabilities (Dragoi et al., 9 Oct 2025).
To capture the scaling law more generally, models such as the Beta prior framework posit a distribution . The expected failure rate after samples then decays as a power-law: where the exponent is fitted empirically and controls the asymptotic decay rate (Levi, 21 Oct 2024).
2. Statistical Theory and Robust Prediction of Pass@ at Scale
Forecasting pass@ at large from limited samples presents significant statistical challenges. Naive power-law fits or direct extrapolation are biased due to deterministic sample dependence and heteroskedasticity. The recommended approach is to fit a Beta-Binomial model to observed (successes, trials) per task: with MLE estimation for . The predictive pass@ is then computed as: Bootstrap resampling yields finite-sample confidence intervals. Dynamic sampling (allocating more attempts to hardest problems) further reduces estimation variance (Kazdan et al., 6 Oct 2025).
It follows that for large , pass@ for models with even minuscule on many problems will approach 1—a degeneracy that reveals fundamental limitations of pass@ as a metric of robust reasoning.
3. Algorithmic Strategies for Inference Scaling
3.1 Reinforcement Learning with Verifiable Rewards and Pass@ Optimization
Conventional RL with verifiable rewards (RLVR) and PPO-style updates incentivize high probability on the top-1 response (over-concentration), leading to improved pass@$1$ but degraded pass@ for (Peng et al., 16 Oct 2025). SimKO (Simple Pass@ Optimization) modifies the importance ratio in RLVR by applying (i) top- probability boosts to verified-correct responses at high-entropy tokens and (ii) steeper penalties to overconfident top-1 tokens in incorrect responses:
This asymmetric smoothing lifts the entire pass@ curve, with especially pronounced gains at high (+4.4pp at on math benchmarks), and does so without sacrificing output fluency (see tuning heuristics in section 6).
Pass@ policy optimization (PKPO) directly targets the gradient of pass@ using unbiased efficient estimators for both binary and continuous rewards, further reducing variance through refined baselines. Empirically, annealing during training enables concurrent optimization of pass@$1$ and pass@ (Walder et al., 21 May 2025).
3.2 Efficient Inference-Time Scaling Methods
Sampling-based scaling can be optimized through several methods:
- MatryoshkaThinking recursively combines generation, self-verification, and summarization, efficiently retaining correct subtraces while reducing token costs by over 20× relative to deep ensemble methods, with nearly identical pass@ outcomes (Chen et al., 11 Oct 2025).
- Entropy-Aware Generation (EAGer) adaptively branches at high-entropy decision points in the output sequence, allocating the sample budget dynamically—yielding equivalent or superior pass@ with up to 65% fewer tokens used versus full parallel sampling (Scalena et al., 13 Oct 2025).
- Diversified Sampling (DivSampling) draws completions from perturbed prompts or problem augmentations, increasing sample diversity; under mild conditions, the failure probability decreases strictly faster than for repeated sampling on a static prompt, especially benefiting harder tasks and higher (Wang et al., 16 Feb 2025).
Efficient algorithmic strategies also exist for batch completion: superposed decoding achieves completions in approximately constant time per token, as opposed to the baseline cost (Shen et al., 28 May 2024).
3.3 Minimax-Optimal Selection
Neither majority voting (majority frequency) nor Best-of- (top reward) is scaling-monotonic or minimax optimal for pass@. Best-of-Majority (BoM), which restricts selection to high-frequency outputs before applying the reward model, is minimax-optimal and uniquely ensures regret decays as while avoiding reward hacking at large (Di et al., 3 Oct 2025).
4. Reliability, Breadth-Depth Trade-offs, and Metric Limitations
Pass@ at large tends to overstate model capability by capturing "random guessing" rather than genuine reliability. For discrete answer spaces, pass@ as soon as for a given problem. To explicitly distinguish between breadth (number of reachably solvable tasks) and depth (reliability/consistency), Cover@ is introduced: Cover@ tracks the fraction of tasks solved at per-sample reliability at least . Pass@ is a Beta-weighted average of Cover@ with mass concentrated at low as increases, making it a poor proxy for depth. Leaderboard rankings fundamentally shift when Cover metrics are used: exploration-promoting methods can achieve broader and more reliable coverage than pure exploiters (Dragoi et al., 9 Oct 2025).
5. Inference Scaling in Continuous Space Reasoning
Dropout-based sampling with continuous latent trajectories (as in COCONUT) enables diverse sample generation and rising pass@ with . However, standard PRM and ORM reward models fail to effectively discriminate correct from incorrect continuous "thought spaces" due to geometric and dynamic homogeneity. Key metrics—IsoScore*, Hoyer sparsity, trajectory curvature—show minimal separation between correct and incorrect trajectories, and even small Gaussian perturbations yield only slight accuracy drop. This suggests the need for training-time inductive biases (e.g., isotropy regularization, contrastive trajectory objectives) specifically tailored for continuous latent spaces to enable effective pass@ scaling and reranking (Wang et al., 14 Oct 2025).
6. False Positives and Ultimate Scaling Limits
Inference scaling methods (Best-of-, self-consistency, tree-search) can only increase the chance of obtaining an answer that passes automated checks; they do not mitigate the prevalence of "false positives"—outputs with correct final answers but flawed reasoning. Empirically, even with , the gap between automated pass@ and human-validated accuracy remains ≈20–30 percentage points, and the scaling exponent halves for manual accuracy versus automated. This ceiling is due to the persistence of structural reasoning errors across samples, which raw sampling cannot overcome. Thus, inference scaling is fundamentally limited by model deductive weaknesses, not just search coverage (Wang et al., 10 Feb 2025).
7. Practical Recommendations and Cost Trade-offs
- To maximize pass@ per computational cost, employ dynamic or entropy-aware sampling, sample diversity methods, and minimax-optimal selection schemes (BoM).
- For applications with reliability requirements, evaluate models on Cover@ at application-relevant thresholds.
- In large-scale systems, robust estimation of rare event scaling (e.g., exploit/jailbreak rates) must use beta-binomial fitting and dynamic sampling to minimize risk estimation error for large , rather than pure curve extrapolation (Kazdan et al., 6 Oct 2025).
- Sampling budgets above see sharply diminishing returns, and further increases often overstate practical capability due to false positives and unreliable outputs.
- Hardware-efficient architectures (e.g., Mamba distilled reasoners) combined with optimized selection can surpass more accurate but slower models under fixed compute or latency budgets (Paliotta et al., 27 Feb 2025).
8. Theoretical and Empirical Integration with Neural Scaling Laws
The functional form for inference loss mirrors neural scaling laws for model/data scaling, and can be directly connected to inference-phase resource allocation: where is total compute, prompt tokens, decode tokens. This framework enables global optimization over training and inference compute investments for desired accuracy targets (Levi, 21 Oct 2024).
This synthesis establishes pass@ inference scaling as a multidimensional phenomenon governed by model statistics, sampling and selection strategies, metric limitations, and application-specific trade-offs, with substantial implications for both capability evaluation and responsible deployment of LLMs.