Abstain-R1: Calibrated Abstention and Post-Refusal Clarification via Verifiable RL

Published 18 Apr 2026 in cs.CL and cs.AI | (2604.17073v1)

Abstract: Reinforcement fine-tuning improves the reasoning ability of LLMs, but it can also encourage them to answer unanswerable queries by guessing or hallucinating missing information. Existing abstention methods either train models to produce generic refusals or encourage follow-up clarifications without verifying whether those clarifications identify the key missing information. We study queries that are clear in meaning but cannot be reliably resolved from the given information, and argue that a reliable model should not only abstain, but also explain what is missing. We propose a clarification-aware RLVR reward that, while rewarding correct answers on answerable queries, jointly optimizes explicit abstention and semantically aligned post-refusal clarification on unanswerable queries. Using this reward, we train Abstain-R1, a 3B model that improves abstention and clarification on unanswerable queries while preserving strong performance on answerable ones. Experiments on Abstain-Test, Abstain-QA, and SelfAware show that Abstain-R1 substantially improves over its base model and achieves unanswerable-query behavior competitive with larger systems including DeepSeek-R1, suggesting that calibrated abstention and clarification can be learned through verifiable rewards rather than emerging from scale alone.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a novel RL approach that integrates verifiable rewards to calibrate abstention and provide actionable clarifications for unanswerable queries.
It employs a two-phase training pipeline—starting with supervised fine-tuning followed by reinforcement learning with groupwise verifiable rewards—to ensure both precise answer retrieval and effective refusal behavior.
Empirical results show that Abstain-R1 outperforms larger models by significantly reducing hallucinations while maintaining answer accuracy across diverse domains.

Clarification-Aware Abstention in LLMs via RL with Verifiable Rewards: An Analysis of Abstain-R1

Background and Motivation

Recent advances in reinforcement learning (RL) for LLMs have led to gains in complex reasoning and generalization. However, this progress often comes at the expense of reliability—most notably, an increased tendency to hallucinate when faced with unanswerable queries. Hallucination—a phenomenon termed the "Hallucination Tax"—arises when models confidently produce answers by inventing or assuming unstated information. Previous efforts to mitigate these risks either promote generic refusals (e.g., responding "I don't know") via supervised fine-tuning (SFT) or encourage post-refusal clarifications without verifying semantic alignment with missing information. Both paradigms lack the supervision necessary for robust calibration and actionable explanations in high-stakes domains. "Abstain-R1: Calibrated Abstention and Post-Refusal Clarification via Verifiable RL" (2604.17073) directly addresses these shortcomings by proposing a novel RL-based finetuning scheme that jointly optimizes strict abstention and high-precision clarification, supported by automatic, verifiable rewards.

Problem Definition: Unanswerable Queries and Post-Refusal Clarification

The unanswerable cases considered in this work are not ambiguous in language or intent but are underconstrained due to missing, contradictory, or inherently unknowable information. A reliable system should refrain from guessing and, after abstention, identify what information is required to resolve the query, making refusal actionable and collaborative rather than inert. Existing SFT-based approaches often lead to brittle templated abstentions, and RL-based approaches reward only the act of abstention, not the substantive content of clarifications. As a result, most models fail to match genuine expert behavior as expected in safety-critical environments.

Figure 1: Differential behaviors on an unanswerable query: hallucinated answer, irrelevant clarification, and precise identification of missing information.

Methodology: RLVR with Clarification-Aware Rewards

Dataset Construction

Abstain-R1’s training regime begins with carefully curated datasets:

Abstain-CoT (SFT stage): Incorporates answerable and unanswerable queries across math, science, reading comprehension, medicine, ethics, and more. For unanswerable items, the supervision explicitly prescribes abstaining, followed by actionable clarifications.
Abstain-Test (Evaluation): Designed to measure both abstention and quality of clarification on held-out data. Includes diverse domains with fine-grained annotations for answerability and reference clarifications.
Figure 2: Distribution of domains for SFT and evaluation, ensuring broad generalization.

Model Initialization and Training Pipeline

Training is divided into two key phases:

Supervised Finetuning (SFT) on Abstain-CoT, enforcing output schemas and teaching initial clarification strategies.
Reinforcement Learning with Verifiable Rewards (RLVR): Leveraging Group Relative Policy Optimization (GRPO), Abstain-R1 is trained such that (i) answerable queries are scored by strict answer correctness, and (ii) unanswerable queries receive a composite reward: explicit abstention earns partial credit, with further reward granted only if the clarification precisely matches reference clarifications as judged by a verification model.
Figure 3: Overview of the RLVR pipeline: data construction, SFT initialization, and RL with groupwise verifiable rewards.

Reward Function Design

Format reward enforces a strict response structure.
Answerable reward penalizes false refusals (-1) and rewards correct answers (+1).
Abstention reward for unanswerable queries grants 0.3 for explicit refusal, and the remaining 0.7 only if the clarification is verified as semantically correct.
Clarification verification is handled by a lightweight LLM verifier optimized for precision.

Empirical Results

Quantitative Benchmarks

Abstain-R1 is contrasted with a diverse suite of open-source and proprietary models across Abstain-Test, Abstain-QA, and SelfAware benchmarks. Notably, Abstain-R1:

Achieves the highest refusal and clarification rates on unanswerable queries among all evaluated models, often outperforming or matching much larger models (32B, 8B, or proprietary) despite its small 3B parameter budget.
Maintains or improves conditional answer accuracy on answerable questions, indicating that improvements in refusal calibration do not degrade answer coverage or correctness.
Figure 4: U-Clar and U-Ref vs. model size, showing Abstain-R1’s gains over both scaling and non-RL approaches.

Training Dynamics

During RL, Abstain-R1’s refusal, clarification accuracy, and answer accuracy simultaneously increase, with a minor initial rise in mean response length followed by a compression toward concise, actionable outputs.

Figure 5: Training steps vs. mean response length, reflecting increased precision and brevity with more RL steps.

Figure 6: Abstention rate, clarification correctness, and answer accuracy over RL steps—indicating simultaneous improvements in all axes.

Qualitative Analysis

Domain-specific case studies reveal that Abstain-R1:

Fact-Checking: Refuses to answer implausible or contradictory queries, rather than heuristically "repairing" or hallucinating answers.
Medical Reasoning: Identifies missing clinical features and declines to diagnose when evidence is insufficient—contrary to baseline models, which guess.
Mathematics: Flags missing variables in underdetermined problems rather than hallucinating numeric values.
Bias/Ethics: Avoids speculative, stereotype-based answers, and instead signals that information is insufficient for comparison.
Figure 7: Fact-checking—Abstain-R1 detects contradiction and appropriately refuses.

Figure 8: Medical domain—Abstain-R1 flags underspecification, refusing to guess diagnoses.
$Figure 9$
Figure 9: Mathematics—Abstain-R1 recognizes ill-posedness, abstaining with explicit clarification.

Figure 10: Bias/Ethics—Abstain-R1 refuses to make unsupported generalizations.

Ablations and Reward Sensitivity

Ablation studies establish:

Removal of SFT sharply degrades both refusal and clarification.
RL without clarification rewards results in generic or irrelevant clarifications despite frequent abstentions.
Models trained with weaker or no penalties for false refusals on answerable questions become overly conservative, highlighting the necessity of careful reward balancing.

Figure 11: Refusal and clarification gains vs. reward weight, visualizing trade-offs between refusal, clarification, and false refusals on answerable queries.

Prompting and SFT vs. RLVR

In-context learning and SFT alone can trigger abstention behavior, but these methods are outperformed by RLVR in both abstention calibration and actionable clarification, with RLVR yielding a superior trade-off between false refusals and overall accuracy.

Theoretical and Practical Implications

Abstain-R1 empirically demonstrates that:

Reliable, calibrated abstention and actionable clarification do not emerge naturally from increased scale or default RL post-training procedures; rather, they require targeted, verifiable reward strategies.
RLVR with clarification-aware objectives enables smaller models to approach or surpass much larger systems with respect to safety-critical behaviors, reducing the dependence on brute-force scaling.
Explicitly targeting post-refusal clarification provides actionable uncertainty, facilitating human-in-the-loop workflows and trust calibration for use-cases such as clinical, legal, and fact-sensitive question answering.

From a theoretical perspective, Abstain-R1 operationalizes abstention as a first-class, compositional policy. It positions verifiable RL objectives as essential for the emergence of robust epistemic uncertainty in LLMs.

Future Directions

The Abstain-R1 paradigm suggests a compelling trajectory for safe and effective LLM deployment:

Extending clarification-aware abstention to multilingual, open-ended, and interactive agent contexts where uncertainty propagation and communication are critical (e.g., tool-use, interactive planning).
Integrating human-in-the-loop or weakly supervised clarifications to mitigate verifier bias and increase coverage of valid clarifications.
Exploring more granular reward shaping at the clarification level, including hierarchical and multi-fidelity verifiers, to further close the gap with human experts.

Advances in verifiable reward functions and automatic labeling will be instrumental in scaling the Abstain-R1 framework to broader environments.

Conclusion

Abstain-R1 sets a new standard for LLM behavior under epistemic uncertainty by coupling strict abstention with actionable, verifiable clarifications. Its architecture and reward design robustly outperform baseline and scale-based alternatives, preserve answer accuracy, and deliver domain-transferable reliability without reliance on human feedback. This work positions clarification-aware RLVR as a critical component for the next generation of trustworthy, collaborative AI systems (2604.17073).

Markdown Report Issue