CoT-Pass@K: Evaluating Logical Reasoning in LLMs

Updated 13 November 2025

The paper shows that CoT-Pass@K extends traditional Pass@K by enforcing both correct answers and validated logical reasoning within complete chains-of-thought.
It employs automated LLM-based verification protocols to assess deterministic answer correctness alongside multi-pass chain validation, ensuring robust evaluations.
Empirical outcomes highlight significant diagnostic differences, demonstrating that RLVR methods improve reasoning fidelity compared to classical evaluation metrics.

The CoT-Pass@K metric is a quantitative measure designed to evaluate reasoning models, especially LLMs, on tasks that demand both accurate outputs and verifiable chains of reasoning. Unlike the classical Pass@K, which is limited to answer correctness among K samples, CoT-Pass@K additionally enforces correctness and logical integrity in the entire reasoning trace ("chain-of-thought", or CoT). This metric arises from recent critiques that simple answer-based evaluation can overstate model competence by overlooking spurious or incomplete reasoning paths. CoT-Pass@K has become a principal diagnostic in reinforcement learning with verifiable rewards (RLVR), algorithmic reasoning benchmarks, and emerging graph-guided reasoning suites.

1. Formal Definition and Mathematical Structure

Consider a model that, for a single prompt $q$ , outputs $G$ independent responses, each comprising a chain-of-thought $c_i$ and a final answer $a_i$ . For each response, two indicators are assigned:

$\mathcal{I}_{\mathrm{Ans}(a_i)} = 1$ iff the final answer $a_i$ is correct; $0$ otherwise.
$\mathcal{I}_{\mathrm{CoT}(c_i)} = 1$ iff the chain $c_i$ is logically accurate and complete; $0$ otherwise.

Define $D = \sum_{i=1}^G \mathcal{I}_{\mathrm{CoT}(c_i)}\,\mathcal{I}_{\mathrm{Ans}(a_i)}$ as the number of samples passing both criteria. The per-prompt CoT-Pass@K is: $\mathrm{CoT\text{-}Pass@K}^{(q)} = 1 - \frac{{G-D \choose K}}{{G \choose K}}$ Aggregating over a test set of $M$ prompts yields: $\mathrm{CoT\text{-}Pass@K} = \frac{1}{M}\sum_{q=1}^M \mathrm{CoT\text{-}Pass@K}^{(q)}$ This combinatorial estimator is identical in structure to classical Pass@K, except that $D$ now encodes verifiable, complete reasoning.

2. Algorithmic Computation and Practical Procedures

Practical evaluation at scale relies on automated verification of chain correctness, commonly with LLM-as-verifier protocols. Each response is subjected to deterministic answer-checking and one or more passes of a strong LLM verifier on the chain-of-thought, with aggregation ("any-correct," "majority-correct," or "strict-all-correct") to reduce false judgments.

A typical evaluation loop for a fixed prompt $q$ with $G$ samples:

responses = sample_G_responses(model, q)
C = 0; D = 0
for y in responses:
    c, a = extract_CoT_and_answer(y)
    ans_correct = verify_answer(a)
    CoT_correct = verify_CoT_multiple(c) # e.g., LLM judge with voting
    C += ans_correct
    D += ans_correct and CoT_correct
for K in Ks:
    PassK_q = 1 - comb(G-C, K)/comb(G, K)
    CoTPassK_q = 1 - comb(G-D, K)/comb(G, K)

Variants incorporate logical closure metrics such as those in DAG-Math (Zhang et al., 19 Oct 2025), where only chains covering a valid sub-DAG of the reference proof graph are counted as logically closed.

3. Theoretical Motivation and Correctness Incentives

Empirical phenomena demonstrated that RLVR-trained LLMs improve Pass@1 but may stall or regress on Pass@K for large $K$ , leading to the “no new reasoning” paradox. Analysis shows that classical Pass@K credits any correct answer, including those produced by incorrect or fortuitous chains. CoT-Pass@K, by contrast, demands logical integrity; it reveals that RLVR uniquely incentivizes the generalization of correct reasoning, not just answer production.

Theoretical results (see Theorem 1 of (Wen et al., 17 Jun 2025)) formalize this mechanism. Under policy-gradient updates with verifiable chains, the expected advantage for correct CoTs is positive, whereas those for incorrect CoTs are negative: $\mathbb{E}[\hat{A} \mid \text{correct CoT}] \propto (1-p_c)(\alpha - \beta) > 0 \ \mathbb{E}[\hat{A} \mid \text{incorrect CoT}] \propto -p_c(\alpha - \beta) < 0$ where $p_c$ is the fraction of correct CoTs among correct answers, and $\alpha > \beta$ are conditional success probabilities. This drives policy updates to favor logically correct chains.

Classical Pass@K measures, for a sample of $K$ outputs, the chance at least one is correct. In the context of chain-of-thought sampling, this may be formalized as: $\mathrm{Pass@K}^{(q)} = 1 - \frac{{G-C \choose K}}{{G \choose K}}$ where $C$ is the number of samples with a correct answer, regardless of chain validity. CoT-Pass@K replaces $C$ with $D$ , imposing the chain-of-thought correctness constraint.

Graph-guided frameworks (DAG-Math (Zhang et al., 19 Oct 2025)) extend CoT-Pass@K to logical closure, requiring sampled reasoning chains not only to terminate with the correct answer but to traverse a sub-DAG consistent with problem rules. The empirical gap $\Delta_1 = \mathrm{Pass@1} - \mathrm{CoT\text{-}Pass@1}$ often exceeds 30 percentage points, indicating widespread occurrence of spurious answers unaccompanied by fully valid chains.

5. Empirical Outcomes and Diagnostic Power

Recent experiments on math and reasoning benchmarks consistently reveal that RLVR-induced improvements are only visible through CoT-Pass@K. For instance, on AIME 2024/25:

$\mathrm{Pass@16}_{\text{base}} \approx 35\%$ versus $\mathrm{Pass@16}_{\text{RLVR}} \approx 32\%$ ,
$\mathrm{CoT\text{-}Pass@16}_{\text{base}} \approx 10\%$ versus $\mathrm{CoT\text{-}Pass@16}_{\text{RLVR}} \approx 45\%$ , and analogous disparities for $K=1024$ .

The fraction of correct CoTs among correct answers rises early in RLVR training, and maintains out-of-sample generalization. On easier or memorization-prone datasets, this gain may be compressed.

Logical closeness metrics introduced in DAG-Math reveal that models with superficially high answer accuracy may nevertheless lack consistent, rule-governed reasoning, as shown by sharp AUC drops with increased strictness of closure constraints.

6. Limitations and Interpretive Cautions

Computing CoT-Pass@K at scale relies on automated verification and binomial combinatorics. LLM-based CoT judges may misclassify complex or ambiguous reasoning. The metric is sensitive to the definition of logical correctness, voting schemes for chain validation, and ground-truth reference chains. In evaluation protocols incorporating variants (e.g., (Dalal et al., 19 May 2025)), CoT-Pass@K can be amplified by diverse chains arising from prompt inconsistency, provided equivalence and correctness are reliably checked.

When logical closure requirements are relaxed, CoT-Pass@K approaches classical Pass@K, and the diagnostic power correspondingly weakens.

7. Implications for Future Research and Evaluation

CoT-Pass@K serves as an essential metric for diagnosing true reasoning ability in generative models, complementing outcome-only measures. Its adoption clarifies that methods like RLVR and PKPO (Walder et al., 21 May 2025) incentivize genuine logical improvement, not just answer finding. By integrating structure-based constraints (e.g., DAG logical closeness), researchers are better equipped to characterize reasoning fidelity, distinguish search-based performance from principled inference, and paper the development of reasoning under joint exploration-exploitation objectives (Chen et al., 14 Aug 2025).

A plausible implication is that future benchmark construction and RL algorithms will increasingly rely on CoT-Pass@K and its structural refinements to advance the granularity and reliability of machine reasoning evaluation.