Document-Level Claim Extraction

Updated 25 November 2025

Document-level claim extraction is the process of mapping full documents to atomic, decontextualized, and check-worthy factual claims for robust fact-checking.
It integrates extractive summarization with entailment filtering and QA-based decontextualization to resolve context dependency and enhance semantic clarity.
Recent research demonstrates improved extraction accuracy by leveraging LLM-based evaluations, penalty calibration, and advanced alignment metrics.

Document-level claim extraction is a core task within automated fact-checking pipelines, requiring the identification and interpretation of check-worthy factual claims from an entire document, rather than from isolated sentences. This task involves extracting claims that are atomic, decontextualized, and worth verifying, imposing challenges connected to language variety, context-dependence, and evaluation. Recent research formalizes and advances document-level claim extraction by integrating extractive summarization, decontextualization techniques, and optimal alignment procedures, offering improved benchmarks for semantic fidelity and downstream utility (Deng et al., 2024, Makaiová et al., 18 Nov 2025).

1. Task Definition and Theoretical Foundations

Document-level claim extraction (CE) involves mapping a document $D$ —such as a news article, comment, or legal ruling—onto a set of factual claims $C = \{c_1, c_2, ..., c_m\}$ , with each $c_i$ satisfying three properties (Makaiová et al., 18 Nov 2025):

Atomicity: Each claim expresses a single factual proposition.
Decontextualization: Each claim is interpretable outside the original document, resolving pronouns, ellipses, and presupposed context.
Check-worthiness: Each claim is of public interest and merits verification.

The extraction process must address ambiguity, redundancy, and the challenge of distinguishing semantic content from linguistic surface forms, especially in morphologically rich or informal language domains. Decontextualization is particularly emphasized to ensure that claims are actionable by both human and automated verifiers without dependency on the original context (Deng et al., 2024).

2. Pipeline Architectures and Extraction Methodologies

Two main classes of extraction pipelines are documented:

a. Extractive Summarization with Entailment Filtering

One approach treats claim extraction as the identification of central, non-redundant sentences via an extractive summarizer (e.g., BertSum), followed by redundancy reduction through document-level entailment models (DocNLI) (Deng et al., 2024). The process for a document $D = \{s_1, ..., s_n\}$ follows:

Sentence Scoring: Each sentence $s_i$ receives a scalar centrality score via:

$\mathrm{score}_i = \sigma(W\,\mathbf{s}_i + b_0)$

where $\mathbf{s}_i$ is the [CLS]-token embedding.

Selection and Non-Entailment Filtering: Top- $k$ sentences are iteratively filtered to exclude those entailed by earlier selections:

$S' = \mathrm{DocNLI}( \{s'_1,...,s'_k\} )$

producing a concise, non-redundant claim candidate set $S'$ .

b. Decontextualisation via Question-Answering and Generation

To produce stand-alone claims (i.e., decontextualized), each candidate undergoes explicit context addition. This is operationalized by:

Extracting potentially ambiguous units $U_i$ (entities, pronouns).
Generating questions $Q_i$ for each unit.
Answering with retrieval (BM25) and QA models (UnifiedQA).
Converting QA pairs to declarative context sentences ( $\hat{s}_i^{(j)}$ ) using a model like BART fine-tuned on QA2D.
Concatenating the extra context with the original claim and rewriting through a decontextualisation model:

$y_i = \begin{cases} D(s'_i, C'_i), & \text{if CAT = feasible} \ s'_i, & \text{otherwise} \end{cases}$

where $y_i$ is the resulting decontextualized claim (Deng et al., 2024).

3. Alignment and Evaluation Metrics

Evaluating document-level claim extraction requires comparing sets of extracted claims to human-annotated references, which is formalized as a set alignment problem (Makaiová et al., 18 Nov 2025). Given two claim sets $C^1$ (system) and $C^2$ (reference), and a similarity function $\mathrm{score}: \mathcal{C} \times \mathcal{C} \rightarrow \mathbb{R}$ , the optimal matching $M^*$ maximizes:

$M^* = \underset{M \text{ a matching}}{\arg\max} \sum_{(c_i, c_j)\in M} \mathrm{score}(c_i, c_j)$

Unmatched claims are penalized, enforcing constraints on both over- and under-generation.

Metric	Principle	Notable Limitations
Edit Similarity	Levenshtein-based	Fails on paraphrase, sensitive to code-mixing and inflectional variability
BERTScore	Embedding similarity	Handles paraphrase, insensitive to negation flips
LLM-as-Judge (logit-based)	LLM semantic scoring	Highest human correlation, but does not explicitly check atomicity/decontextualization/checkworthiness

LLM-based scores (prompting Phi-4/Qwen2.5 and aggregating logits) demonstrate highest alignment with human preferences and outperform surface metrics in morphologically rich and informal languages (Makaiová et al., 18 Nov 2025).

4. Datasets and Domain-specific Challenges

Recent datasets designed for this task include (Makaiová et al., 18 Nov 2025):

CGCD (Czech Gold Claims Dataset): 60 Czech news comments, 2.17 claims/document, single annotator.
CSAD (Czech-Slovak Agreement Dataset): 120 comments (mixed Czech/Slovak), 3.12 claims/document, multiple annotators.

Domain challenges include informal register, high surface/morphological variability, frequent code-mixing, and subtle negation phenomena that degrade the utility of string or embedding-based evaluation. The AVeriTeC-DCE dataset (Deng et al., 2024) includes 1,231 document–claim pairs with a median of 17 tokens per claim, focusing on central sentences and their revised stand-alone forms.

5. Experimental Findings and Method Comparisons

State-of-the-art pipelines utilize LLMs (ChatGPT 4o mini, GPT-4.1, Claude Sonnet 3.7) for step-wise claim identification and check-worthiness filtering. Key quantitative results are:

Score Performance: GPT-4o mini achieves highest BERTScore (60.2 on CGCD), corroborated by LLM-judge scores (45.3, Phi-4). Sonnet 3.7 is competitive.
Claim Count Control: Informed (count-constrained) extraction boosts alignment scores (e.g., Sonnet 3.7 BERTScore improves from 52.8 to 83.4).
Penalty Calibration: Over- and under-generation penalties strongly affect metric rankings; correct claim count estimation is critical.
Human Alignment: LLM-as-judge metrics show best agreement with human assessment, especially in informed settings (Makaiová et al., 18 Nov 2025).

In decontextualisation-specific evaluations (Deng et al., 2024), QA-based rewriting led to improved evidence retrieval and human interpretability, especially for complex phenomena (coreference, scoping, bridge anaphora).

6. Limitations and Prospects for Advancement

Metric Limitations: Current LLM-based evaluation captures semantic overlap but does not independently verify atomicity, decontextualization, or check-worthiness of output claims. Proposals include integrating explicit subchecks for these properties.
Dataset Scale: Existing resources are small; broader validation is required for robustness.
Penalty Terms: Tuning or learning penalty costs for spurious/missing claims is an open area.
Decontextualisation Coverage: Only a minority of extracted claims are amenable to decontextualisation, limiting method completeness in some domains (Deng et al., 2024).
Component Dependence: Heavy reliance on pretrained modules (summarization, NLI, QA, decontextualisation, LLMs) may constrain domain transferability and robustness.

Future directions include unified training strategies for joint claim extraction and decontextualisation, the integration of multimodal cues, and the development of evaluation pipelines that explicitly model claim properties beyond semantic similarity, potentially leveraging rich human-judgment data for calibration (Makaiová et al., 18 Nov 2025, Deng et al., 2024).

Markdown Upgrade to Chat

References (2)

Document-level Claim Extraction and Decontextualisation for Fact-Checking (2024)

Examining the Metrics for Document-Level Claim Extraction in Czech and Slovak (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Document-Level Claim Extraction.

Document-Level Claim Extraction

1. Task Definition and Theoretical Foundations

2. Pipeline Architectures and Extraction Methodologies

a. Extractive Summarization with Entailment Filtering

b. Decontextualisation via Question-Answering and Generation

3. Alignment and Evaluation Metrics

4. Datasets and Domain-specific Challenges

5. Experimental Findings and Method Comparisons

6. Limitations and Prospects for Advancement

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Document-Level Claim Extraction

1. Task Definition and Theoretical Foundations

2. Pipeline Architectures and Extraction Methodologies

a. Extractive Summarization with Entailment Filtering

b. Decontextualisation via Question-Answering and Generation

3. Alignment and Evaluation Metrics

4. Datasets and Domain-specific Challenges

5. Experimental Findings and Method Comparisons

6. Limitations and Prospects for Advancement

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research