Document-Level Claim Extraction
- Document-level claim extraction is the process of mapping full documents to atomic, decontextualized, and check-worthy factual claims for robust fact-checking.
- It integrates extractive summarization with entailment filtering and QA-based decontextualization to resolve context dependency and enhance semantic clarity.
- Recent research demonstrates improved extraction accuracy by leveraging LLM-based evaluations, penalty calibration, and advanced alignment metrics.
Document-level claim extraction is a core task within automated fact-checking pipelines, requiring the identification and interpretation of check-worthy factual claims from an entire document, rather than from isolated sentences. This task involves extracting claims that are atomic, decontextualized, and worth verifying, imposing challenges connected to language variety, context-dependence, and evaluation. Recent research formalizes and advances document-level claim extraction by integrating extractive summarization, decontextualization techniques, and optimal alignment procedures, offering improved benchmarks for semantic fidelity and downstream utility (Deng et al., 5 Jun 2024, Makaiová et al., 18 Nov 2025).
1. Task Definition and Theoretical Foundations
Document-level claim extraction (CE) involves mapping a document —such as a news article, comment, or legal ruling—onto a set of factual claims , with each satisfying three properties (Makaiová et al., 18 Nov 2025):
- Atomicity: Each claim expresses a single factual proposition.
- Decontextualization: Each claim is interpretable outside the original document, resolving pronouns, ellipses, and presupposed context.
- Check-worthiness: Each claim is of public interest and merits verification.
The extraction process must address ambiguity, redundancy, and the challenge of distinguishing semantic content from linguistic surface forms, especially in morphologically rich or informal language domains. Decontextualization is particularly emphasized to ensure that claims are actionable by both human and automated verifiers without dependency on the original context (Deng et al., 5 Jun 2024).
2. Pipeline Architectures and Extraction Methodologies
Two main classes of extraction pipelines are documented:
a. Extractive Summarization with Entailment Filtering
One approach treats claim extraction as the identification of central, non-redundant sentences via an extractive summarizer (e.g., BertSum), followed by redundancy reduction through document-level entailment models (DocNLI) (Deng et al., 5 Jun 2024). The process for a document follows:
- Sentence Scoring: Each sentence receives a scalar centrality score via:
where is the [CLS]-token embedding.
- Selection and Non-Entailment Filtering: Top- sentences are iteratively filtered to exclude those entailed by earlier selections:
producing a concise, non-redundant claim candidate set .
b. Decontextualisation via Question-Answering and Generation
To produce stand-alone claims (i.e., decontextualized), each candidate undergoes explicit context addition. This is operationalized by:
- Extracting potentially ambiguous units (entities, pronouns).
- Generating questions for each unit.
- Answering with retrieval (BM25) and QA models (UnifiedQA).
- Converting QA pairs to declarative context sentences () using a model like BART fine-tuned on QA2D.
- Concatenating the extra context with the original claim and rewriting through a decontextualisation model:
where is the resulting decontextualized claim (Deng et al., 5 Jun 2024).
3. Alignment and Evaluation Metrics
Evaluating document-level claim extraction requires comparing sets of extracted claims to human-annotated references, which is formalized as a set alignment problem (Makaiová et al., 18 Nov 2025). Given two claim sets (system) and (reference), and a similarity function , the optimal matching maximizes:
Unmatched claims are penalized, enforcing constraints on both over- and under-generation.
| Metric | Principle | Notable Limitations |
|---|---|---|
| Edit Similarity | Levenshtein-based | Fails on paraphrase, sensitive to code-mixing and inflectional variability |
| BERTScore | Embedding similarity | Handles paraphrase, insensitive to negation flips |
| LLM-as-Judge (logit-based) | LLM semantic scoring | Highest human correlation, but does not explicitly check atomicity/decontextualization/checkworthiness |
LLM-based scores (prompting Phi-4/Qwen2.5 and aggregating logits) demonstrate highest alignment with human preferences and outperform surface metrics in morphologically rich and informal languages (Makaiová et al., 18 Nov 2025).
4. Datasets and Domain-specific Challenges
Recent datasets designed for this task include (Makaiová et al., 18 Nov 2025):
- CGCD (Czech Gold Claims Dataset): 60 Czech news comments, 2.17 claims/document, single annotator.
- CSAD (Czech-Slovak Agreement Dataset): 120 comments (mixed Czech/Slovak), 3.12 claims/document, multiple annotators.
Domain challenges include informal register, high surface/morphological variability, frequent code-mixing, and subtle negation phenomena that degrade the utility of string or embedding-based evaluation. The AVeriTeC-DCE dataset (Deng et al., 5 Jun 2024) includes 1,231 document–claim pairs with a median of 17 tokens per claim, focusing on central sentences and their revised stand-alone forms.
5. Experimental Findings and Method Comparisons
State-of-the-art pipelines utilize LLMs (ChatGPT 4o mini, GPT-4.1, Claude Sonnet 3.7) for step-wise claim identification and check-worthiness filtering. Key quantitative results are:
- Score Performance: GPT-4o mini achieves highest BERTScore (60.2 on CGCD), corroborated by LLM-judge scores (45.3, Phi-4). Sonnet 3.7 is competitive.
- Claim Count Control: Informed (count-constrained) extraction boosts alignment scores (e.g., Sonnet 3.7 BERTScore improves from 52.8 to 83.4).
- Penalty Calibration: Over- and under-generation penalties strongly affect metric rankings; correct claim count estimation is critical.
- Human Alignment: LLM-as-judge metrics show best agreement with human assessment, especially in informed settings (Makaiová et al., 18 Nov 2025).
In decontextualisation-specific evaluations (Deng et al., 5 Jun 2024), QA-based rewriting led to improved evidence retrieval and human interpretability, especially for complex phenomena (coreference, scoping, bridge anaphora).
6. Limitations and Prospects for Advancement
- Metric Limitations: Current LLM-based evaluation captures semantic overlap but does not independently verify atomicity, decontextualization, or check-worthiness of output claims. Proposals include integrating explicit subchecks for these properties.
- Dataset Scale: Existing resources are small; broader validation is required for robustness.
- Penalty Terms: Tuning or learning penalty costs for spurious/missing claims is an open area.
- Decontextualisation Coverage: Only a minority of extracted claims are amenable to decontextualisation, limiting method completeness in some domains (Deng et al., 5 Jun 2024).
- Component Dependence: Heavy reliance on pretrained modules (summarization, NLI, QA, decontextualisation, LLMs) may constrain domain transferability and robustness.
Future directions include unified training strategies for joint claim extraction and decontextualisation, the integration of multimodal cues, and the development of evaluation pipelines that explicitly model claim properties beyond semantic similarity, potentially leveraging rich human-judgment data for calibration (Makaiová et al., 18 Nov 2025, Deng et al., 5 Jun 2024).