PASS@(k, T) Framework Evaluation

Updated 17 April 2026

The paper introduces PASS@(k, T) as a metric that generalizes Pass@k by jointly measuring sampling breadth (k) and interaction depth (T) for LLMs.
It employs probabilistic models and unbiased estimators to differentiate efficiency gains from genuine capability expansions in diverse applications.
The framework supports agentic tool use, code generation, and RL by providing rigorous targets for optimizing multi-stage tasks.

The PASS@ $(k, T)$ framework extends the classical Pass@ $k$ metric for evaluating LLMs and agentic AI systems by introducing a two-dimensional, user-centric measure of success under constrained sampling and interaction resources. While Pass@ $k$ traditionally estimates the probability that at least one of $k$ sampled outputs solves a problem, PASS@ $(k, T)$ generalizes this by jointly parameterizing: (1) the number of independent trajectories, samples, or candidate solutions ( $k$ ), and (2) the number of interaction rounds, environment steps, or input perturbations ( $T$ ) permitted per attempt. This framework enables precise quantification of both the efficiency with which a model can solve tasks via sampling (breadth) and its capability to leverage more complex, sequential, or compositional reasoning through deeper interaction (depth). As such, PASS@ $(k, T)$ provides a principled basis for evaluating, optimizing, and analyzing LLMs, especially in settings involving tool use, agentic planning, or meta-variation over task instances (Dalal et al., 19 May 2025, Zhai et al., 16 Apr 2026, Lyu et al., 2024).

1. Formal Definition and Core Metric

PASS@ $(k, T)$ is defined as the probability that, after drawing $k$ independent trajectories—each of which may use up to $k$ 0 rounds of interaction with the environment, tool API, or variant generation pipeline—at least one trajectory succeeds. Formally, for a given task $k$ 1 and agent policy $k$ 2, one has: $k$ 3 where each $k$ 4 is a trajectory (solution attempt) generated under policy $k$ 5 with $k$ 6 allowed steps, and $k$ 7 is an indicator for task success (Zhai et al., 16 Apr 2026). In empirical studies, PASS@ $k$ 8 is often estimated using $k$ 9 rollouts as: $k$ 0 with $k$ 1 the number of successful trajectories observed. The metric interpolates between static reasoning ( $k$ 2, recovering classic Pass@ $k$ 3) and agentic multi-step problem solving ( $k$ 4).

2. Motivations and Theoretical Foundations

The introduction of the $k$ 5 parameter addresses critical limitations of prior evaluation schemes. In classic code generation and math reasoning tasks, Pass@ $k$ 6 solely captures sampling reliability—how likely it is to obtain a valid solution by increasing the sample count $k$ 7—but cannot differentiate between models that merely sample more diversely and those that genuinely solve harder or deeper problems via interaction (Dalal et al., 19 May 2025, Lyu et al., 2024). By incorporating $k$ 8, PASS@ $k$ 9 enables rigorous disambiguation of:

Efficiency Gains (sampling axis $k$ 0): How rapidly does adding more attempts increase success probability?
Capability Boundary Expansion (interaction axis $k$ 1): Can the model solve problems that are otherwise unattainable without deeper interaction or multiple reasoning/tool-use rounds?

The framework formalizes these axes, allowing empirical separation of improvements due to increased search breadth versus genuine expansion of the solvable problem space. For example, in static tasks, RL often merely reallocates probability mass within existing strategies, leading to pass-rate convergence as $k$ 2 grows; in contrast, on compositional or multi-stage tasks, increasing $k$ 3 can unlock new capabilities not accessible through resampling alone, and RL can expand the boundary of solvable queries (Zhai et al., 16 Apr 2026).

3. Probabilistic Models of Inconsistency and Variants

PASS@ $k$ 4 is leveraged not only for interaction depth but also for modeling intra-task variation—e.g., generating $k$ 5 semantically equivalent variants of a challenge and soliciting solutions from each ("Variator" agents) (Dalal et al., 19 May 2025). A representative probabilistic model assumes the original task has per-sample success rate $k$ 6, and each generated variant exhibits clipped uniform fluctuation: $k$ 7 The expected variant pass rate $k$ 8 is then a piecewise function of $k$ 9 and $(k, T)$ 0. Under i.i.d. variant draws, one obtains: $(k, T)$ 1 where $(k, T)$ 2 solutions are submitted per variant $(k, T)$ 3 ( $(k, T)$ 4). For $(k, T)$ 5 and $(k, T)$ 6, this simplifies to $(k, T)$ 7 (Dalal et al., 19 May 2025).

This model yields two core theoretical guarantees:

Exponential Lower Bound: $(k, T)$ 8.
Regret Bound: The difference between optimized variant-based and repeat-based strategies vanishes exponentially with $(k, T)$ 9.

These results clarify that, by embracing and structuring over the natural inconsistency of LLM responses, aggregate performance on Pass@ $k$ 0 can be robustly and predictably increased.

4. Empirical Estimation and Optimization Strategies

In practice, PASS@ $k$ 1 can be estimated or optimized using:

Hypergeometric estimation for tasks with finite rollouts or candidate pools (Zhai et al., 16 Apr 2026, Lyu et al., 2024).
Unbiased estimators for full or partial groupings, as in code evaluation or RLVR. For binary rewards, the estimator is

$k$ 2

with $k$ 3 the count of correct solutions (Walder et al., 21 May 2025, Chen et al., 14 Aug 2025).

Direct policy optimization: Using tailored per-sample reward transformations, such as those in Pass-at- $k$ 4 Policy Optimization (PKPO), which admit analytic, unbiased, and low-variance gradient estimators for both binary and continuous reward settings (Walder et al., 21 May 2025, Chen et al., 14 Aug 2025).
Surrogate losses for ranking: In the Top Pass framework, training objectives directly maximize PASS@ $k$ 5 via pairwise or groupwise surrogate losses on candidate rankings, hard example mining, and joint cross-entropy stabilization (Lyu et al., 2024).

Joint tuning of $k$ 6 and $k$ 7 thus allows practitioners to balance exploration (diversity and breadth) and exploitation (precision and depth).

5. Applications: Agentic Tool Use, Code Generation, and RLVR

PASS@ $k$ 8 has become a standard in domains where candidate generation and solution verification are decoupled, including:

Agentic tool-using LLMs: Evaluating systems where the number of retrievals, tool calls, or API steps per sample (i.e., $k$ 9) is variable, such as multi-hop search, compositional reasoning, or planning (Zhai et al., 16 Apr 2026). Only via PASS@ $T$ 0 can one empirically distinguish RL-driven capability expansion (divergence of pass-curves at large $T$ 1 and $T$ 2) from mere improvement in sampling efficiency.
Code generation systems: Where a model typically samples $T$ 3 candidate programs, potentially ranks them, and presents the top $T$ 4 to the user; the PASS $T$ 5 metric quantifies the expected probability that at least one correct solution appears among the $T$ 6 user-inspected candidates (Lyu et al., 2024).
Reinforcement Learning with Verifiable Rewards (RLVR): RL algorithms can optimize for desired points on the pass@ $T$ 7 curve, enabling new strategies for balancing solution diversity and sample quality; differentiated reward shaping further facilitates annealing between exploration and exploitation during training (Walder et al., 21 May 2025, Chen et al., 14 Aug 2025, Thrampoulidis et al., 27 Oct 2025).

6. Theoretical Properties, Bounds, and Limitations

PASS@ $T$ 8 subsumes various lower-dimensional metrics and admits strong theoretical analysis:

For $T$ 9, it reduces to classical static Pass@ $(k, T)$ 0.
For $(k, T)$ 1, it quantifies the full potential of the agent under unlimited depth.
Combinatorial closed forms and variance-reduction baselines are available for efficient estimation and optimization (Dalal et al., 19 May 2025, Walder et al., 21 May 2025, Chen et al., 14 Aug 2025, Lyu et al., 2024).
Bounds: Exponential convergence of pass-rate to 1 as $(k, T)$ 2, $(k, T)$ 3 increase, provided the model places nonzero measure on correct solutions.
Scaling monotonicity: New inference strategies, such as Best-of-Majority (BoM), achieve minimax-optimal regret rates as the sampling budget increases and are provably robust to choice of $(k, T)$ 4, $(k, T)$ 5 (Di et al., 3 Oct 2025).

Limitations:

All theoretical guarantees are contingent on model assumptions (i.i.d. sampling, symmetric variation) and (where relevant) correctness of variant/equivalence mappings.
Computational load: The cost of generating and evaluating multiple variants or deep interaction trajectories can be significant (Dalal et al., 19 May 2025).
The framework applies where verifiable, automatic correctness evaluation is possible.

7. Implications, Extensions, and Future Directions

The adoption of PASS@ $(k, T)$ 6 enables granular, multidimensional analysis of LLMs and agentic systems:

Disentangling reliability (efficiency) and capability boundary expansion at evaluation time. For static tasks, RL only improves sampling efficiency, but for sequential or compositional tasks, $(k, T)$ 7 increases are essential to realize true capability growth (Zhai et al., 16 Apr 2026).
Structured exploitation of model inconsistency can yield provable aggregate gains—even in the regime of highly capable, low-variance frontier models—suggesting persistent relevance for future generations (Dalal et al., 19 May 2025).
Facilitates automated curriculum learning and hard-example mining by exposing the regions of the $(k, T)$ 8 parameter space where models are most uncertain or exhibit sharp transitions in capability (Lyu et al., 2024).
Provides a rigorous target for the design of new RL algorithms, advantage shaping schemes, and inference protocols optimized for joint sample/interaction constraints (Walder et al., 21 May 2025, Thrampoulidis et al., 27 Oct 2025, Di et al., 3 Oct 2025).
Ongoing directions include automated equivalence testing for task variants, adaptive sample allocation, and further integration with Bayesian model evaluation frameworks to address variance and uncertainty quantification (Hariri et al., 5 Oct 2025).

By formalizing the dual resource axes of sample breadth and interaction depth, PASS@ $(k, T)$ 9 constitutes a foundational metric for the evaluation and optimization of LLM-driven reasoning, code, and agentic systems.