CoT-Pass@K: Evaluating Logical Reasoning in LLMs
- The paper shows that CoT-Pass@K extends traditional Pass@K by enforcing both correct answers and validated logical reasoning within complete chains-of-thought.
- It employs automated LLM-based verification protocols to assess deterministic answer correctness alongside multi-pass chain validation, ensuring robust evaluations.
- Empirical outcomes highlight significant diagnostic differences, demonstrating that RLVR methods improve reasoning fidelity compared to classical evaluation metrics.
The CoT-Pass@K metric is a quantitative measure designed to evaluate reasoning models, especially LLMs, on tasks that demand both accurate outputs and verifiable chains of reasoning. Unlike the classical Pass@K, which is limited to answer correctness among K samples, CoT-Pass@K additionally enforces correctness and logical integrity in the entire reasoning trace ("chain-of-thought", or CoT). This metric arises from recent critiques that simple answer-based evaluation can overstate model competence by overlooking spurious or incomplete reasoning paths. CoT-Pass@K has become a principal diagnostic in reinforcement learning with verifiable rewards (RLVR), algorithmic reasoning benchmarks, and emerging graph-guided reasoning suites.
1. Formal Definition and Mathematical Structure
Consider a model that, for a single prompt , outputs independent responses, each comprising a chain-of-thought and a final answer . For each response, two indicators are assigned:
- iff the final answer is correct; $0$ otherwise.
- iff the chain is logically accurate and complete; $0$ otherwise.
Define as the number of samples passing both criteria. The per-prompt CoT-Pass@K is: Aggregating over a test set of prompts yields: This combinatorial estimator is identical in structure to classical Pass@K, except that now encodes verifiable, complete reasoning.
2. Algorithmic Computation and Practical Procedures
Practical evaluation at scale relies on automated verification of chain correctness, commonly with LLM-as-verifier protocols. Each response is subjected to deterministic answer-checking and one or more passes of a strong LLM verifier on the chain-of-thought, with aggregation ("any-correct," "majority-correct," or "strict-all-correct") to reduce false judgments.
A typical evaluation loop for a fixed prompt with samples:
1 2 3 4 5 6 7 8 9 10 11 |
responses = sample_G_responses(model, q) C = 0; D = 0 for y in responses: c, a = extract_CoT_and_answer(y) ans_correct = verify_answer(a) CoT_correct = verify_CoT_multiple(c) # e.g., LLM judge with voting C += ans_correct D += ans_correct and CoT_correct for K in Ks: PassK_q = 1 - comb(G-C, K)/comb(G, K) CoTPassK_q = 1 - comb(G-D, K)/comb(G, K) |
Variants incorporate logical closure metrics such as those in DAG-Math (Zhang et al., 19 Oct 2025), where only chains covering a valid sub-DAG of the reference proof graph are counted as logically closed.
3. Theoretical Motivation and Correctness Incentives
Empirical phenomena demonstrated that RLVR-trained LLMs improve Pass@1 but may stall or regress on Pass@K for large , leading to the “no new reasoning” paradox. Analysis shows that classical Pass@K credits any correct answer, including those produced by incorrect or fortuitous chains. CoT-Pass@K, by contrast, demands logical integrity; it reveals that RLVR uniquely incentivizes the generalization of correct reasoning, not just answer production.
Theoretical results (see Theorem 1 of (Wen et al., 17 Jun 2025)) formalize this mechanism. Under policy-gradient updates with verifiable chains, the expected advantage for correct CoTs is positive, whereas those for incorrect CoTs are negative: where is the fraction of correct CoTs among correct answers, and are conditional success probabilities. This drives policy updates to favor logically correct chains.
4. Comparison with Classical Pass@K and Related Metrics
Classical Pass@K measures, for a sample of outputs, the chance at least one is correct. In the context of chain-of-thought sampling, this may be formalized as: where is the number of samples with a correct answer, regardless of chain validity. CoT-Pass@K replaces with , imposing the chain-of-thought correctness constraint.
Graph-guided frameworks (DAG-Math (Zhang et al., 19 Oct 2025)) extend CoT-Pass@K to logical closure, requiring sampled reasoning chains not only to terminate with the correct answer but to traverse a sub-DAG consistent with problem rules. The empirical gap often exceeds 30 percentage points, indicating widespread occurrence of spurious answers unaccompanied by fully valid chains.
5. Empirical Outcomes and Diagnostic Power
Recent experiments on math and reasoning benchmarks consistently reveal that RLVR-induced improvements are only visible through CoT-Pass@K. For instance, on AIME 2024/25:
- versus ,
- versus , and analogous disparities for .
The fraction of correct CoTs among correct answers rises early in RLVR training, and maintains out-of-sample generalization. On easier or memorization-prone datasets, this gain may be compressed.
Logical closeness metrics introduced in DAG-Math reveal that models with superficially high answer accuracy may nevertheless lack consistent, rule-governed reasoning, as shown by sharp AUC drops with increased strictness of closure constraints.
6. Limitations and Interpretive Cautions
Computing CoT-Pass@K at scale relies on automated verification and binomial combinatorics. LLM-based CoT judges may misclassify complex or ambiguous reasoning. The metric is sensitive to the definition of logical correctness, voting schemes for chain validation, and ground-truth reference chains. In evaluation protocols incorporating variants (e.g., (Dalal et al., 19 May 2025)), CoT-Pass@K can be amplified by diverse chains arising from prompt inconsistency, provided equivalence and correctness are reliably checked.
When logical closure requirements are relaxed, CoT-Pass@K approaches classical Pass@K, and the diagnostic power correspondingly weakens.
7. Implications for Future Research and Evaluation
CoT-Pass@K serves as an essential metric for diagnosing true reasoning ability in generative models, complementing outcome-only measures. Its adoption clarifies that methods like RLVR and PKPO (Walder et al., 21 May 2025) incentivize genuine logical improvement, not just answer finding. By integrating structure-based constraints (e.g., DAG logical closeness), researchers are better equipped to characterize reasoning fidelity, distinguish search-based performance from principled inference, and paper the development of reasoning under joint exploration-exploitation objectives (Chen et al., 14 Aug 2025).
A plausible implication is that future benchmark construction and RL algorithms will increasingly rely on CoT-Pass@K and its structural refinements to advance the granularity and reliability of machine reasoning evaluation.