- The paper introduces calibrated claim-level specificity control to match each claim’s precision with underlying evidence.
- It decomposes responses into atomic claims, using dual support scoring (fine and coarse) calibrated via Clopper-Pearson thresholds.
- Empirical tests on the LongFact benchmark reveal improved utility and specificity retention while significantly reducing unsupported claims.
Calibrated Claim-Level Specificity Control for Agentic Systems
Motivation and Problem Setting
Agentic LLM-based systems exhibit a failure mode characterized not by outright fabrication but by excessive precision—producing multi-claim outputs with sections for which the specificity level is unsupported by underlying evidence. Prior approaches, focused on answer-level abstention or selective prediction, fail to address the compositional uncertainty inherent in multi-part responses where some claim granularity is empirically justified while others are not. The paper "Answer Only as Precisely as Justified: Calibrated Claim-Level Specificity Control for Agentic Systems" (2604.17487) formalizes this as an overcommitment control problem: for every claim generated in response to a prompt, determine the highest specificity admissible given retrieved evidence.
Method: Compositional Selective Specificity and Calibration
The approach decomposes outputs into atomic claims, each paired with a coarse (less-specific) backoff and the null (omission) action. For every claim, support scores are computed for both fine-grained and coarse representations against the evidence. The central mechanism—the calibrated selection policy—employs thresholds on these support scores, chosen through conservative calibration (Clopper-Pearson upper bounds) on a held-out split, to select the most specific but supportable claim form. This design allows the system to locally back off precision only where warranted, instead of globally abstaining from answering or omitting entire claims. The pipeline comprises:
- Draft Generation: Base model produces candidate response.
- Claim Extraction/Backoff: Decomposition of draft into atomic claims, with coarse rewrites.
- Support Scoring: Hybrid LLM/verifier model assigns support scores to fine/coarse versions.
- Claimwise Selection: Thresholded, calibrated policy decides which specificity level to emit.
This workflow produces fine-grained, compositional uncertainty control, targeting exactly those claims where evidence is insufficient.
Figure 1: Example of claim-level specificity control, where unsupported claim detail is locally backed off to a valid, evidence-supported, coarser claim.
Baselines and Ablations
The work exhaustively contrasts the calibrated specificity selector with several alternatives:
- No Control: Retains all original claims at maximal specificity (risk of overcommitment).
- Whole-Answer Abstention: Accept-or-abstain at the response level, discarding all claims if any are unsupported.
- Claim Dropping: Omit unsupported claims but never back off to coarse alternatives.
- Uncalibrated Selector: Fixed, non-adaptive selection thresholds.
- Oracle: Uses gold support labels for upper-bound selection.
This design clarifies the contribution of claim-level, calibrated selection as distinct from deletion or global uncertainty management.
Empirical Results
Experiments are conducted on the LongFact benchmark (2,280 prompts, >11,000 claims) under a claimwise evaluation protocol, supplemented by pilots on LongFact and HotpotQA subsets (evaluated with GPT-5.4 and Claude Sonnet 4.6). The principal metric is overcommitment-aware utility, which rewards supported specificity, penalizes unsupported emissions, and accounts for retention at each specificity level.
The calibrated selector delivers:
- Full LongFact: Overcommitment-aware utility improves from 0.8460 (no control) to 0.9130 (calibrated), with 0.938 retained specificity and 0.9865 support precision—maintaining nearly all supported content while sharply reducing unsupported claims.
- Whole-abstain baseline: Achieves high precision (0.9825) but at the cost of large information loss (0.6292 specificity retention, utility 0.6072).
- Claim-drop: Good precision and utility (0.9877, 0.9019), but less effective than calibrated at preserving justified, partially supported coarse claims.
- Uncalibrated: Extremely conservative (0.9934 precision) but discards more valid content (utility 0.8521).
These results are consistent across pilots, where calibration consistently delivers large increases in specificity retention and overall utility with only marginal precision loss.
Figure 2: Policy comparison by overcommitment-aware utility on LongFact; calibrated selector outperforms fixed-threshold, deletion-only, and whole-response abstention baselines.
Theoretical and Practical Implications
Practically, the method enables more robust agentic reasoning—downstream consumers (retrievers, auditors, or pipelined LMs) can act on differentially granular claims, escalating or triggering further search for only those claims with downgraded specificity. This augments reliability without resorting to overconservative answer suppression.
Theoretically, this work highlights that uncertainty over output granularity is a meaningful axis of control—distinct from canonical confidence estimation or selective prediction. Current work provides empirical, calibration-based safety constraints, but there is a clear path to integrating stronger statistical validity (e.g., full-sample or conformal guarantees over structured outputs).
Limitations and Future Directions
Several limitations are acknowledged:
- The empirical protocol is tailored to claimwise evaluation, not official leaderboard pipelines like SAFE/F1@K.
- Oracle experiments serve only as non-deployable upper bounds.
- Performance is bounded by the quality of claim extraction, backoff, and support scoring.
- Sequential or interactively monitored agentic settings are left for future work.
Immediate future work involves bolstering support scoring (potentially with robust, distribution-free verifiers), integrating claim-level control in live agent pipelines with active tool use, and formalizing inferential guarantees for compositional specificity selection.
Conclusion
This paper demonstrates that claim-level, calibrated specificity control—implemented as a post-generation uncertainty layer—enables agentic systems to answer only as precisely as justified by evidence, maximizing retained utility and sharply reducing unsupported detail. The results have direct implications for reliable tool use, reporting, complex QA, and any context requiring precise uncertainty semantics. The claimwise perspective outlined here reframes long-form factuality as a control problem, pushing agentic design toward modular, compositional, and uncertainty-aware outputs.