Sub-claim Extraction: Methods and Evaluation
- Sub-claim extraction is the process of decomposing complex claims into minimal, verifiable propositions that enable targeted fact checking.
- It employs methods like prompt-based LLM decomposition, RL optimization, and modular filtering to ensure atomicity, coverage, and reliability.
- Evaluation frameworks use metrics such as coherence, faithfulness, and downstream gains to assess the quality and impact of sub-claim extraction.
Sub-claim extraction, also referred to as claim decomposition, is the process of mapping a complex, potentially multi-faceted claim or natural language sentence into a finite set of atomic sub-claims or propositions. Each sub-claim should represent a minimal, self-contained assertion that can be independently verified or refuted against evidence. This decomposition is central in fact verification, adversarial claim checking, and fine-grained factuality evaluation, as it allows for targeted validation of each aspect of a statement and robust error localization. The choice of decomposition strategy—methodology, automation, and quality controls—has direct consequences on the reliability, interpretability, and robustness of downstream scoring or verification systems.
1. Formal Definitions and Decomposition Frameworks
Sub-claim extraction formalizes complex claim analysis as follows: given an input claim , produce a set of atomic sub-claims , with each constructed to be independently interpretable and directly verifiable. In structured claim verification systems, each is paired with a subset of evidence and receives a veracity label (True/False/Unverified), with an aggregation function yielding the final claim-level verdict :
Alignment of evidence to sub-claims is an essential component; evidence can be associated at the claim level (SRE: repeated claim-level evidence) or explicitly aligned at the sub-claim level (SAE: sub-claim aligned evidence), as shown in (Akhter et al., 11 Feb 2026).
Sub-claim boundaries are typically defined as contiguous, short sentences or clauses, each corresponding to a single proposition. The guiding principle (e.g., (Wanner et al., 2024)) is to maximize atomicity—one fact per sub-claim—while maintaining full coverage of the original claim.
Hierarchical and aspect-based models (e.g., ClaimSpect (Kargupta et al., 12 Jun 2025)) generalize this concept, allowing extraction of sub-claim trees with arbitrarily deep levels, where each node is an aspect or finer-grained sub-aspect.
2. Annotation Protocols, Datasets, and Sub-claim Characterization
Annotation schema for sub-claims emphasize atomicity, standalone verifiability, and coverage. For example, in (Akhter et al., 11 Feb 2026), annotation is carried out by expert annotators, who extract and label each sub-claim, ensuring it is check-worthy and does not rely on implicit context. Each sub-claim receives a veracity label and is mapped to a minimal set of supporting or opposing evidence.
Dataset construction details include:
- PHEMEPlus (Akhter et al., 11 Feb 2026): 399 complex claims, 1,169 sub-claims (≈2.93 per claim), expert-labeled with strict temporal evidence alignment.
- ClaimDecomp (Chen et al., 2022): 1,494 political claims, 6,555 annotated subquestions (≈2.7 per claim), distinguishing literal and implied subquestions.
- Expanded biography/knowledge base domains (“Core” (Jiang et al., 2024); “A Closer Look…” (Wanner et al., 2024)).
Agreement metrics include Bennett’s 0 (0.81 in (Akhter et al., 11 Feb 2026)), Fleiss’ 1 (0.52 in (Chen et al., 2022)), and sentence-level overlap via BLEU/BERTScore. Comprehensive coverage of both explicit and implicit aspects is emphasized, and annotation guidelines recommend atomic, non-overlapping, linguistically coherent formulations.
3. Methodologies for Sub-claim Extraction
Prompt-based LLM Decomposition
The dominant paradigm involves LLM prompting, with templates constructed to elicit highly atomic, self-contained sub-claims from complex input (Akhter et al., 11 Feb 2026, Wanner et al., 2024, Gong et al., 2024, Liu et al., 5 Jun 2025, Chen et al., 2022). Extraction is typically zero-shot or in-context, optionally enhanced with post-processing for pronoun resolution and format normalization.
Example methodology (Gong et al., 2024):
- Prepare in-context examples: demonstrate atomic splitting, pronoun removal, preservation of qualifiers.
- Prompt LLM with instruction to dissect into numbered, standalone statements.
- Parse model outputs into a list of sub-claims.
Pseudo-algorithm (EACon (Gong et al., 2024)): 3
Structured Reasoning and RL Optimization
Reinforcement learning frameworks, such as Distill-and-Align Decomposition (DAD (Magomere et al., 25 Feb 2026)), combine supervised fine-tuning on teacher-distilled decompositions with policy optimization for multi-objective rewards. The DAD reward incorporates: format compliance, alignment with verifier preferences, and decomposition quality as judged by an LLM-based rubric. Sequential prompts enforce claim detection, decontextualization, relationship identification, and atomic extraction.
GRPO (Group Relative Policy Optimization) (Magomere et al., 25 Feb 2026) provides variance reduction for policy updates, allowing simultaneous optimization across formatting, faithfulness, and task performance.
Modular and Semantic Filtering
Core (Jiang et al., 2024) augments any decomposition procedure with a selection mechanism that scores sub-claims for faithfulness (entailment by context), informativeness (conditional surprisal via UNLI), and non-redundancy (textual entailment between sub-claims). An integer linear program selects a maximally informative, non-overlapping subset subject to a faithfulness threshold.
Hierarchical/Aspect Extraction
ClaimSpect (Kargupta et al., 12 Jun 2025) recursively applies an LLM and a discriminative retrieval module to construct a tree of aspects and sub-aspects, each node annotated with its own keywords and retrieved supporting segments. Discriminativeness scores guide the selection of corpus evidence most representative of a specific aspect versus distractors.
Shallow/Parse-based Methods
Earlier approaches involve dependency parsing (PredPatt), semantic role labeling, or shallow segmenters with later LLM re-writing. However, these approaches show lower sub-claim cohesion, atomicity, or coverage compared to in-context LLM decomposition (Wanner et al., 2024).
4. Evaluation Methodology, Metrics, and Results
Unlike typical extraction tasks, explicit extraction F1/precision/recall is rarely measured; sub-claim extraction is treated either as an oracle (fixed step), or its quality is assessed by impact on downstream tasks (Akhter et al., 11 Feb 2026, Gong et al., 2024, Liu et al., 5 Jun 2025).
Key evaluation metrics include:
- DecompScore (Wanner et al., 2024): Measures the number and coherence of sub-claims entailed by the original claim.
2
This quantifies both atomicity and coverage.
- FActScore (Wanner et al., 2024): Fraction of sub-claims supported by external evidence.
- Human/Automatic Recovery (Chen et al., 2022): Human-matched recall rate for subquestions (e.g., 0.74 for literal, 0.18 for implied with T5-3B model).
- Downstream Gains: Removal of sub-claim extraction drops claim verification F1 by 4–7 points on standard datasets (Gong et al., 2024, Liu et al., 5 Jun 2025).
- Robustness: Core filtering eliminates inflation of factual precision by trivial or repetitive sub-claims, restoring robustness to adversarial generations (Jiang et al., 2024).
- Human Judgments: Structured decomposition (DAD (Magomere et al., 25 Feb 2026)) achieves high marks (>0.75) on verifiability, coherence, clarity, and uniqueness.
5. Error Analysis, Robustness, and Limitations
Explicit error analysis is often focused not on the extraction spans themselves but on the propagation of label errors and alignment mismatches to downstream metrics (Akhter et al., 11 Feb 2026).
Known failure modes include:
- Granularity Mismatch: Too coarse or too fine granularity can harm verification accuracy (Magomere et al., 25 Feb 2026, Liu et al., 5 Jun 2025).
- Implicit Fact Omission: Standard prompts often miss implied or contextually licensed sub-claims (Chen et al., 2022, Wanner et al., 2024).
- Hallucinations: LLMs may invent unsupported sub-claims, especially without coverage checks or strict coherence validation (Wanner et al., 2024, Liu et al., 5 Jun 2025).
- Non-redundancy: Repetitive or paraphrased sub-claims can game factual precision metrics unless filtered (Jiang et al., 2024).
Robustness strategies:
- Coverage + Coherence Filtering (Wanner et al., 2024, Jiang et al., 2024): Use validators to remove unsupported sub-claims.
- Abstention in Labeling (Akhter et al., 11 Feb 2026): Aggressive forced prediction of T/F labels propagates more error than conservative "unknown" predictions.
Limitations:
- No standard, reliable reference metric for extraction quality beyond DecompScore.
- Extraction models are primarily tested in English and domain-general or entity-centric settings.
- Many pipelines require extensive in-context example engineering or hand-crafted rules for maximal performance.
6. Practical Recommendations and Future Directions
Empirical findings suggest several best practices:
- Prioritize atomicity and coverage: Use decomposition prompts or retrieval-augmented strategies that enforce splitting on conjunctions, explicit qualifiers, and all relevant event arguments (Wanner et al., 2024).
- Ground decomposition with retrieval: Iterative, discriminative retrieval (ClaimSpect) helps align aspect trees to corpus evidence and suppress hallucinations (Kargupta et al., 12 Jun 2025).
- Filter and select: Deploy selection layers (Core) to enforce informativeness and eliminate redundancy before verifying sub-claims (Jiang et al., 2024).
- Optimize jointly: Align extraction and verifier behavior via RL or other feedback approaches to tune sub-claim granularity (Magomere et al., 25 Feb 2026).
- Validate extraction steps: Use LLM validators or dedicated coverage checks to ensure both completeness and faithfulness (Wanner et al., 2024, Jiang et al., 2024).
Future work is identifying more robust evaluation metrics, extending extraction methods and metrics to multilingual and multi-modal settings, integrating domain ontology constraints, learning adaptive or context-sensitive granularity, and exploring truly end-to-end decomposer-verifier-retriever pipelines (Kargupta et al., 12 Jun 2025, Magomere et al., 25 Feb 2026, Akhter et al., 11 Feb 2026).