Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

173 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

AbstentionBench: LLM Abstention Evaluation

Updated 30 June 2025

AbstentionBench is a large-scale benchmark designed to evaluate large language models' capacity to abstain when faced with unanswerable, underspecified, or ambiguous queries.
It integrates 20 diverse datasets containing over 35,000 queries, targeting scenarios like unknown answers, false premises, and outdated information to test model epistemic caution.
Its evaluation pipeline employs LLM-based judges and metrics such as abstention recall, revealing critical gaps in uncertainty reasoning and safe deployment practices.

AbstentionBench is a large-scale benchmark designed to systematically evaluate LLMs on their ability to abstain—i.e., to refuse to answer—when faced with unanswerable, underspecified, ambiguous, or otherwise ill-posed user queries. The benchmark addresses a critical, previously understudied aspect of LLM reliability: for truly trustworthy deployment, especially in everyday and high-stakes applications, LLMs must be able to recognize limitations in their knowledge or uncertainty in a question and respond appropriately with abstention, rather than producing potentially misleading, confabulatory, or confidently incorrect responses. AbstentionBench brings methodological rigor, dataset diversity, and targeted metrics to this challenge, revealing substantial gaps in current LLMs’ ability to reason about uncertainty and underspecification.

1. Motivation and Scope

AbstentionBench targets a foundational problem in trustworthy LLM deployment: real-world user queries frequently contain elements (unknown answers, lack of context, false presuppositions, subjective components, or stale information) that are not answerable in principle or given the model’s knowledge. In such scenarios, robust LLMs should exercise epistemic caution and abstain rather than attempt to answer. The benchmark was developed in response to the lack of a comprehensive, systematic evaluation framework for abstention in LLMs, incorporating a wide range of real-world query types and ambiguity structures. This is especially significant for applications in domains where errors carry serious consequences and “refusal” is preferable to misinformation.

2. Benchmark Design and Dataset Coverage

AbstentionBench consists of 20 datasets spanning over 35,000 unanswerable queries, each crafted or selected to represent scenarios where abstention is warranted. Dataset types include:

Unknown answers: Questions about unsolved scientific problems, future events, or inherently uncertain facts.
Underspecified queries: Dataset versions where requisite contextual information is systematically elided or deliberately omitted, including altered GSM8K (mathematical word problems), MMLU (multidomain reasoning), and GPQA (mathematical olympiad) to induce underspecification.
False premises: Queries that rest on incorrect, impossible, or logically inconsistent statements.
Subjective and ambiguous questions: Prompts soliciting opinions, values, or inherently ambiguous information.
Outdated information: Queries about facts or events postdating the model's knowledge cutoff.

The benchmark draws from sources including web search logs, medical and scientific Q&A, ethical dilemmas, and real-world math/word problem corpora (e.g., UMWP—Unanswerable Math Word Problems).

3. Evaluation Methodology

AbstentionBench employs an automated, LLM-based evaluation pipeline for both abstention and correctness assessment.

Abstention annotation: Each LLM output is classified as an abstention (true positive) or not, using a carefully designed judge prompt with Llama 3.1 8B Instruct. The prompt covers various abstention styles (“I don’t know,” clarifications, caveats) and scenario types, and was validated against human annotation to achieve ~88% alignment.
Correctness evaluation: For non-abstained answers, a separate LLM judge determines binary correctness.
Metrics: The principal metric for abstention capability is recall, defined as the proportion of instances where abstention was warranted and the model correctly abstained. Precision and F1 are also reported, but precision is generally saturated due to the imbalance of unanswerable vs. answerable queries on most datasets. Accuracy scores are used for answer correctness.

This standardized, scalable methodology enables dense sampling across models and facilitates reliable comparative analysis.

4. Key Experimental Findings

Scaling and Reasoning Fine-tuning

Minimal scaling benefit: Increasing model size within families (e.g., Llama models across 8B, 70B, 405B parameters) has negligible or inconsistent impact on abstention recall. Models such as GPT-4o and Gemini-1.5 Pro showed only small improvements and no consistent scaling-abstention correlation.
Negative impact of reasoning fine-tuning: Models trained for explicit reasoning (e.g., DeepSeek R1, s1.1) perform worse at abstention—on average, a 24% decrease in abstention recall compared to their instruction-tuned counterparts, even on domains (math, science) for which reasoning tuning was intended to help. These models frequently hallucinate missing context and produce confident, unwarranted answers rather than abstaining.
Reasoning token budgeting: Allowing models longer reasoning chains (higher token budgets) improves accuracy but worsens abstention recall, indicating a tradeoff whereby increased thoroughness in answering degrades uncertainty awareness.

Prompting and Post-hoc Interventions

System prompts can improve abstention: Carefully engineered prompts that explicitly describe abstention scenarios increased recall across many model/dataset pairs. However, this strategy does not correct fundamental weaknesses—the underlying reasoning patterns still fail to propagate uncertainty into final responses, especially for highly ambiguous or underspecified inputs.

Dataset-level and Task-level Variation

Domain challenges remain unresolved: On tasks requiring reasoning about underspecification, ambiguity, or unseen knowledge (e.g., UMWP, MMLU-Math-Abstain), even top models routinely answer when abstention is correct. Performance varies widely, and neither size nor tuning reliably predict abstention behavior.
No accuracy-recall alignment: There is little correlation between models’ answer accuracy on non-abstained responses and abstention recall, underscoring the orthogonality of these capabilities.

5. Implications and Research Directions

Revealed reliability gap: AbstentionBench demonstrates that current LLMs, regardless of scale or reasoning-oriented training, are not reliably able to recognize unanswerable queries or admit epistemic uncertainty.

Training and reward modeling: Optimizing LLMs for correctness on verifiable datasets (math, code, etc.) incentivizes confident answering, potentially at the expense of proper uncertainty recognition. This suggests future approaches should incorporate explicit abstention scenarios, ambiguous queries, and uncertainty-aware objectives during training and reward modeling.

Prompting vs. architectural change: While prompt engineering provides a partial, short-term solution, the fundamental challenge appears to require modifications at the level of model objectives and possibly architecture—specifically, ensuring that uncertainty ("epistemic humility") is properly reasoned about and surfaced in final model outputs, not merely in tokens mid-chain.

Benchmarking and assessment: AbstentionBench sets a new standard for evaluating LLMs’ handling of uncertainty, ambiguity, and unanswerability. Its extensible datasets and LLM-as-judge pipeline enable ongoing, large-scale comparison and iterative improvement of both open and closed models.

Safety and risk mitigation: The findings emphasize the necessity of reliable abstention for real-world safety, especially in domains (medicine, law, science, public information) where confidently wrong answers are especially harmful.

6. Future Directions and Benchmark Contribution

AbstentionBench motivates research along the following lines:

Diversifying abstention training data: Incorporate more representative abdication scenarios, ambiguous and adversarial queries, and private/unseen data to prevent overfitting.
Uncertainty reasoning interventions: Develop mechanisms for models to propagate and express uncertainty in final outputs, possibly by blending reasoning and selective prediction paradigms.
Reward model recalibration: Incorporate penalties for unwarranted confidence and incentives for explicit uncertainty when appropriate in reinforcement learning from human feedback.
Cross-domain abstention-aware evaluation: Systematically prioritize abstention evaluation in settings with unverifiable ground truth or where the cost of error cannot be precisely quantified.

Ongoing updates and extensibility: AbstentionBench is intended as a community benchmark for continuous assessment. Its coverage is expected to expand in response to new deployment settings and as LLM architectures evolve.

7. Summary Table: Principal Features and Findings of AbstentionBench

Feature	Description	Key Findings
Unanswerable scenario types	Unknowns, underspecification, false premises, subjectivity, stale info	All present significant challenges
Models evaluated	20 LLMs (open/closed, various scales, reasoning fine-tuned)	Reasoning fine-tuning worsens abstention
Primary metric	Abstention recall	No consistent scaling/accuracy correlation
Judge mechanism	LLM-judge with engineered prompts (validated)	~88% alignment with human annotation
Prompts/Interventions	Carefully crafted prompts boost recall but are insufficient	Full solution requires deeper interventions
Implications	Current LLMs are unreliable in refusing unanswerable queries	Accurate reasoning ≠ good abstention

Conclusion

AbstentionBench fills a critical gap in the evaluation of LLMs, providing a rigorous, large-scale, and extensible framework for measuring and improving the ability of LLMs to abstain appropriately in the face of uncertainty, ambiguity, and underspecification. Results underscore that progress in LLM reliability will require explicit focus on uncertainty reasoning, new training paradigms, and continual benchmark-led assessment.

PDF Markdown Chat (Upgrade)