Answer Only as Precisely as Justified: Calibrated Claim-Level Specificity Control for Agentic Systems

Published 19 Apr 2026 in cs.CL | (2604.17487v1)

Abstract: Agentic systems often fail not by being entirely wrong, but by being too precise: a response may be generally useful while particular claims exceed what the evidence supports. We study this failure mode as overcommitment control and introduce compositional selective specificity (CSS), a post-generation layer that decomposes an answer into claims, proposes coarser backoffs, and emits each claim at the most specific calibrated level that appears admissible. The method is designed to express uncertainty as a local semantic backoff rather than as a whole-answer refusal. Across a full LongFact run and HotpotQA pilots, calibrated CSS improves the risk-utility trade-off of fixed drafts. On the full LongFact run, it raises overcommitment-aware utility from 0.846 to 0.913 relative to the no-CSS output while achieving 0.938 specificity retention. These results suggest that claim-level specificity control is a useful uncertainty interface for agentic systems and a target for future distribution-free validity layers.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces calibrated claim-level specificity control to match each claim’s precision with underlying evidence.
It decomposes responses into atomic claims, using dual support scoring (fine and coarse) calibrated via Clopper-Pearson thresholds.
Empirical tests on the LongFact benchmark reveal improved utility and specificity retention while significantly reducing unsupported claims.

Calibrated Claim-Level Specificity Control for Agentic Systems

Motivation and Problem Setting

Agentic LLM-based systems exhibit a failure mode characterized not by outright fabrication but by excessive precision—producing multi-claim outputs with sections for which the specificity level is unsupported by underlying evidence. Prior approaches, focused on answer-level abstention or selective prediction, fail to address the compositional uncertainty inherent in multi-part responses where some claim granularity is empirically justified while others are not. The paper "Answer Only as Precisely as Justified: Calibrated Claim-Level Specificity Control for Agentic Systems" (2604.17487) formalizes this as an overcommitment control problem: for every claim generated in response to a prompt, determine the highest specificity admissible given retrieved evidence.

Method: Compositional Selective Specificity and Calibration

The approach decomposes outputs into atomic claims, each paired with a coarse (less-specific) backoff and the null (omission) action. For every claim, support scores are computed for both fine-grained and coarse representations against the evidence. The central mechanism—the calibrated selection policy—employs thresholds on these support scores, chosen through conservative calibration (Clopper-Pearson upper bounds) on a held-out split, to select the most specific but supportable claim form. This design allows the system to locally back off precision only where warranted, instead of globally abstaining from answering or omitting entire claims. The pipeline comprises:

Draft Generation: Base model produces candidate response.
Claim Extraction/Backoff: Decomposition of draft into atomic claims, with coarse rewrites.
Support Scoring: Hybrid LLM/verifier model assigns support scores to fine/coarse versions.
Claimwise Selection: Thresholded, calibrated policy decides which specificity level to emit.

This workflow produces fine-grained, compositional uncertainty control, targeting exactly those claims where evidence is insufficient.

Figure 1: Example of claim-level specificity control, where unsupported claim detail is locally backed off to a valid, evidence-supported, coarser claim.

Baselines and Ablations

The work exhaustively contrasts the calibrated specificity selector with several alternatives:

No Control: Retains all original claims at maximal specificity (risk of overcommitment).
Whole-Answer Abstention: Accept-or-abstain at the response level, discarding all claims if any are unsupported.
Claim Dropping: Omit unsupported claims but never back off to coarse alternatives.
Uncalibrated Selector: Fixed, non-adaptive selection thresholds.
Oracle: Uses gold support labels for upper-bound selection.

This design clarifies the contribution of claim-level, calibrated selection as distinct from deletion or global uncertainty management.

Empirical Results

Experiments are conducted on the LongFact benchmark (2,280 prompts, >11,000 claims) under a claimwise evaluation protocol, supplemented by pilots on LongFact and HotpotQA subsets (evaluated with GPT-5.4 and Claude Sonnet 4.6). The principal metric is overcommitment-aware utility, which rewards supported specificity, penalizes unsupported emissions, and accounts for retention at each specificity level.

The calibrated selector delivers:

Full LongFact: Overcommitment-aware utility improves from 0.8460 (no control) to 0.9130 (calibrated), with 0.938 retained specificity and 0.9865 support precision—maintaining nearly all supported content while sharply reducing unsupported claims.
Whole-abstain baseline: Achieves high precision (0.9825) but at the cost of large information loss (0.6292 specificity retention, utility 0.6072).
Claim-drop: Good precision and utility (0.9877, 0.9019), but less effective than calibrated at preserving justified, partially supported coarse claims.
Uncalibrated: Extremely conservative (0.9934 precision) but discards more valid content (utility 0.8521).

These results are consistent across pilots, where calibration consistently delivers large increases in specificity retention and overall utility with only marginal precision loss.

Figure 2: Policy comparison by overcommitment-aware utility on LongFact; calibrated selector outperforms fixed-threshold, deletion-only, and whole-response abstention baselines.

Theoretical and Practical Implications

Practically, the method enables more robust agentic reasoning—downstream consumers (retrievers, auditors, or pipelined LMs) can act on differentially granular claims, escalating or triggering further search for only those claims with downgraded specificity. This augments reliability without resorting to overconservative answer suppression.

Theoretically, this work highlights that uncertainty over output granularity is a meaningful axis of control—distinct from canonical confidence estimation or selective prediction. Current work provides empirical, calibration-based safety constraints, but there is a clear path to integrating stronger statistical validity (e.g., full-sample or conformal guarantees over structured outputs).

Limitations and Future Directions

Several limitations are acknowledged:

The empirical protocol is tailored to claimwise evaluation, not official leaderboard pipelines like SAFE/F1@K.
Oracle experiments serve only as non-deployable upper bounds.
Performance is bounded by the quality of claim extraction, backoff, and support scoring.
Sequential or interactively monitored agentic settings are left for future work.

Immediate future work involves bolstering support scoring (potentially with robust, distribution-free verifiers), integrating claim-level control in live agent pipelines with active tool use, and formalizing inferential guarantees for compositional specificity selection.

Conclusion

This paper demonstrates that claim-level, calibrated specificity control—implemented as a post-generation uncertainty layer—enables agentic systems to answer only as precisely as justified by evidence, maximizing retained utility and sharply reducing unsupported detail. The results have direct implications for reliable tool use, reporting, complex QA, and any context requiring precise uncertainty semantics. The claimwise perspective outlined here reframes long-form factuality as a control problem, pushing agentic design toward modular, compositional, and uncertainty-aware outputs.

Markdown Report Issue