Geometry-Calibrated Conformal Abstention for Language Models

Published 30 Apr 2026 in cs.CL and cs.LG | (2604.27914v1)

Abstract: When LLMs lack relevant knowledge for a given query, they frequently generate plausible responses that can be hallucinations, rather than admitting being agnostic about the answer. Retraining models to reward admitting ignorance can lead to overly conservative behaviors and poor generalization due to scarce evaluation benchmarks. We propose a post hoc framework, Conformal Abstention (CA), adapted from conformal prediction (CP) to determine whether to abstain from answering a query. CA provides finite-sample guarantees on both the probability of participation (i.e., not abstaining) and the probability that the generated response is correct. Importantly, the abstention decision relies on prediction confidence rather than the non-conformity scores used in CP, which are intractable for open-ended generation. To better align prediction confidence with the model's ignorance, we introduce a calibration strategy using representation geometry within the model to measure knowledge involvement in shaping the response. Experiments demonstrate that we improve selective answering significantly with 75 percent conditional correctness.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a post-hoc conformal abstention framework that leverages geometric signals from token representations to decide when to answer or abstain.
It utilizes a calibration scheme based on proximity, embedding rotation, and anisotropy alignment to generate robust scalar uncertainty estimates.
Experimental results show a 75% conditional correctness guarantee and superior AUROC/AUPRC performance compared to traditional abstention baselines.

Geometry-Calibrated Conformal Abstention for LLMs

Motivation and Problem Statement

LLMs are frequently deployed in settings where reliability is paramount, yet they tend to respond to queries even when lacking sufficient knowledge, producing linguistically plausible but factually ungrounded outputs (hallucinations). Standard training objectives and evaluation protocols incentivize guessing rather than abstention, resulting in models that rarely admit ignorance. Prior efforts to explicitly reward abstention via retraining face challenges of benchmark scarcity and poor generalization, often leading to indiscriminate refusals.

This work proposes a post-hoc abstention mechanism—Conformal Abstention (CA)—grounded in conformal prediction (CP). CA does not require retraining and is agnostic to base model architectures, offering finite-sample guarantees both for response participation (the probability of generating an answer) and conditional correctness (the probability that retained responses are correct). CA departs from traditional non-conformity scoring, which is infeasible for open generative tasks, and instead leverages calibrated prediction confidence, enhanced with geometric signals derived from the internal representation dynamics of LLMs.

Conformal Abstention Framework

CA employs uncertainty scores to decide whether to answer or abstain for each query. These scores are calibrated on a reference set to yield two guarantees:

Participation Guarantee: Bounds the probability that the model chooses to answer rather than abstain.
Conditional Correctness Guarantee: Bounds the probability that an answered query is actually correct.

Exchangeability arguments ensure theoretical correctness of these guarantees for finite samples. Unlike prior CP approaches relying on likelihood or sampling-based coverage, CA's abstention is predicated on scalar uncertainty estimates, making it applicable to free-form generative tasks.

Geometry-based Confidence Calibration

To improve the alignment between uncertainty scores and true correctness, the authors introduce a calibration scheme leveraging the geometry of token representations within the Transformer architecture. Since MLP layers are found to encode factual knowledge, the framework tracks how knowledge involvement shapes the internal trajectory of token encodings:

Proximity: Measures the direct contribution of MLP updates to the token embedding.
Embedding Rotation: Quantifies the angular change induced by the MLP update, linking geometric transformations to changes in model prediction.
Anisotropy Alignment: Assesses alignment of token embeddings with dominant in-distribution directions, representing canonical representation cones of the model.

These geometric features are aggregated per token across layers, yielding rich signals that distinguish correct from incorrect predictions. Their calibration via Mahalanobis distance—estimated on features from correct/incorrect predictions—enables robust confidence estimation.

Experimental Results

The authors validate CA across six datasets covering open and closed-form QA (e.g., Natural Questions, TruthfulQA, Simple Questions Wiki, SciQ, GSM8K, CommonsenseQA), employing Gemma-3-4B-Instruct, LLaMA-3.2-3B-Instruct, and LLaMA-3-8B-Instruct as base models. Correctness evaluation leverages semantic similarity and multi-stage adjudication using independent LLMs for rigor.

Key findings:

Conditional Correctness: CA achieves the highest average conditional correctness (75%) under conformal abstention, outperforming likelihood-, consistency-, attention-, embedding-, and self-verbalized confidence baselines.
Discriminative Power: AUROC and AUPRC metrics demonstrate superior separability for CA relative to all baselines across datasets, with gains persisting in calibrated and uncalibrated settings.
Component Contribution: Ablation reveals comparable impact of all geometric signals, indicating their synergy rather than redundancy.

Performance evaluations with alternative base models and additional calibration studies further reinforce CA's robustness and generalizability.

Theoretical and Practical Implications

The formalism of CA introduces a paradigm shift in abstention mechanisms for LLMs, providing practical finite-sample guarantees without the need for retraining. Geometry-based calibration improves the correspondence between confidence and correctness by accessing deeper latent knowledge signals—beyond surface-level probabilities. This yields a principled selective answering strategy, raising the reliability of LLM-generated outputs in critical application domains.

From a theoretical standpoint, the work advances understanding of internal representation dynamics as markers of knowledge involvement, motivating further exploration of geometric interpretability in model uncertainty quantification.

Practically, CA enables safer deployment of LLMs by filtering unreliable or hallucinated responses, offering structured trade-offs between coverage and correctness. Its scalability analysis shows quadratic dependence on sequence length, but linear scaling in hidden dimension and layer count, rendering it tractable for moderate sequence lengths and model sizes.

Future Directions

Several avenues merit further investigation:

Extension to Multimodal and Agentive Settings: How geometry-calibrated abstention generalizes to vision-language or agentic LLMs.
Adversarial Robustness: Studying resilience of geometric abstention to adversarial queries and distributional shifts.
Integration with Interactive Systems: Leveraging abstention in multi-turn dialogues and human-in-the-loop workflows, especially under privacy and safety constraints.
Alternative Geometric Features: Exploring other representation-based signals—e.g., topological changes, clustering motifs—as abstention criteria.

Conclusion

Conformal Abstention with geometry-calibrated confidence addresses fundamental reliability shortcomings of LLMs in knowledge-limited scenarios, yielding strong empirical and theoretical correctness guarantees. Its exploitation of internal representation dynamics for calibration emerges as a promising direction for selective answering, enhancing both safety and trust in model outputs. The framework's composability and post-hoc nature position it as a practical tool for high-stakes AI deployments, catalyzing future research into abstention-aware language modeling and geometric interpretability.

(2604.27914)

Markdown Report Issue