- The paper introduces a conformal framework to control aggregation uncertainty in chain-of-thought reasoning through score-weighted voting.
- It leverages confidence-driven abstention and closed-form accuracy prediction to ensure selective accuracy and limit confident errors in LLM outputs.
- Empirical results on GSM8K and other benchmarks demonstrate significant accuracy improvements while maintaining risk control across diverse scoring metrics.
Self-consistency methods in Chain-of-Thought (CoT) reasoning paradigms for LLMs improve answer reliability by aggregating multiple sampled reasoning chains. When using such approaches, correctness is not a property of a single trajectory but of the aggregation rule applied to the path pool. This paper identifies aggregation uncertainty as the critical reliability bottleneck: a majority or consensus can be confidently wrong, especially when distinct reasoning paths are present due to exploration or search. Existing conformal prediction methods focus on token-level or single-trace calibration and do not directly address uncertainty arising from multi-path aggregation.
The central contribution is a fully inference-time conformal framework that generalizes self-consistency by integrating three core operations:
- Score-Weighted Aggregation: Each CoT path is assigned a quality score (reward model, judge LLM, or perplexity-based). Aggregation becomes a weighted vote, where high-quality paths exert proportionately more influence. For a set of m paths, each path is scored, and the normalized weighted sum determines both the final answer and a confidence value.
- Confidence-Driven Abstention: The system abstains on examples where the aggregated confidence falls below a calibrated threshold, mitigating confidently wrong answers. This threshold is set to limit the confident-error rate (the marginal fraction of errors among non-abstained responses) to a user-specified parameter α.
- Conformal Risk Control (CRC): Abstention threshold calibration is performed on a held-out labeled split, yielding a finite-sample guarantee on the confident-error rate under exchangeability. This calibration is agnostic to score choice and strictly inference-time.
This is the first conformal framework to target aggregation risk in multi-path LLM reasoning, rather than intra-trace or output-level uncertainty.
Theoretical Guarantees and Conditions
Theoretical analysis provides several key insights:
- Score Separability: The discriminative power of the vote confidence statistic v (i.e., whether it systematically takes higher values on correct than incorrect predictions) is shown to be both necessary and sufficient for abstention to improve accuracy among non-abstained items. The separability gap Δ(λ) quantifies this; no meaningful selective accuracy gain is possible without it.
- Closed-Form Accuracy Prediction: The selective accuracy achievable at any threshold can be computed in closed-form directly from the calibration data, using only empirical distributions over calibration examples. This enables precise planning of the accuracy-yield tradeoff prior to deployment without further test evaluation.
- Monotonic Accuracy-Yield Frontier: Under strict separability, the tradeoff between yield (fraction answered) and accuracy is monotonic: tightening the abstention threshold improves accuracy on retained examples at the cost of lower coverage.
- Finite-Sample CRC Validity: For exchangeable calibration and test data, the CRC-calibrated threshold controls confident error at or below α in finite samples, independent of model, aggregation, or score function used.
Empirical Results
Evaluations were conducted on GSM8K, MATH, MATH-Hard, and HotpotQA, leveraging four open-source LLMs and three families of quality scores. Headline empirical findings include:
- The realized confident-error rates closely match or slightly undercut the target α across all tested scenarios, demonstrating effective risk control.
- The closed-form selective accuracy predictor and observed accuracy track each other, confirming theoretical predictions.
- On GSM8K, the method achieves 90.1% selective accuracy while abstaining on less than 5% of examples, with the majority-voting baseline yielding 82% accuracy.
- Score choice is the central driver of utility: reward- and judge-based scores consistently yield higher separability, stronger accuracy-yield frontiers, and greater accuracy gains at any coverage level. Perplexity and simple self-consistency scores are less discriminative, limiting the benefit of abstention.
Ablation studies show robustness to the number of sampled paths (yield increases monotonically, accuracy saturates), calibration split size (variance declines with larger calibration sets), and the score weight function (exponential weighting optimal, with diminishing returns for aggressive scaling).
Limitations
The guarantee on confident-error rate is marginal and does not provide per-instance calibration certificates. Guarantees apply only under the assumption of calibration and test data exchangeability; they do not persist under arbitrary distribution shift. The method is focused on tasks with extractable, closed-form answers, and extension to unconstrained generative settings (e.g., free-form text generation) would require reformulation of correctness and error events.
Implications and Future Directions
Pragmatically, this framework establishes a practical reliability layer for CoT-based LLM deployment, allowing performance tuning for coverage/safety in downstream scenarios where confidently incorrect answers are substantially costlier than abstentions (e.g., medical or legal QA, math assistance, autonomous systems). The method is fully model-agnostic, non-intrusive to LLM weights, and compatible with any pool-based reasoning protocol.
Theoretically, the explicit connection between abstention gain and score separability provides a quantifiable, diagnostic criterion for evaluating and improving aggregation functions and path-quality scores—a key step toward robust, reliable multi-step LLM reasoning.
Future developments could pursue extension to open-ended domains, domain-adaptive or Bayesian calibration strategies, tighter per-instance guarantees, or integration with active human-in-the-loop review where abstentions are routed for external adjudication.
Conclusion
This work introduces an inference-time, conformal aggregation framework for chain-of-thought LLM reasoning that provides probabilistic guarantees on the confident-error rate and quantifies the selective accuracy gain achievable by abstention. Reliability is fundamentally governed by the quality of aggregation and the discriminative power of path-quality scores, not by the characteristics of individual reasoning traces. Empirical and theoretical results demonstrate that robust error control is achievable with practical, score-based aggregations, laying the groundwork for reliable selective answering in high-stakes LLM applications.
Reference:
"Pause and Reflect: Conformal Aggregation for Chain-of-Thought Reasoning" (2605.14098)