Document-Level Fact-Checking Architecture

Updated 22 December 2025

Document-Level Fact-Checking Architecture is a system that validates natural-language claims by retrieving and analyzing evidence from large document corpora.
It utilizes a multi-stage pipeline including document retrieval, evidence extraction, stance detection, and claim validation with precise mathematical objectives.
The architecture integrates attention-based aggregation and source reliability to mitigate errors from paraphrasing, class imbalance, and contradictory evidence.

A document-level fact-checking architecture is a computational system that, given a natural-language claim and a large corpus of documents, automatically retrieves relevant information, extracts evidence, estimates the stance of evidence with respect to the claim, and finally produces a verdict on the claim’s factuality. Modern systems implement this as a multi-stage pipeline, with explicit modules for retrieval, evidence selection, stance classification, and aggregation, supported by precise mathematical objectives, modular designs, and rigorous evaluation protocols.

1. Pipeline Decomposition: Core Stages and Data Flow

A canonical document-level fact-checking pipeline decomposes into four principal sequential modules:

Document Retrieval: Rank and select the top-N candidate documents from the background collection, given a query claim.
Evidence Extraction: For each retrieved document, split the text into sentences and rank them to select the top-K candidate evidence sentences.
Stance Detection: For each evidence sentence, classify the stance as supporting, refuting, or neutral with regard to the claim.
Claim Validation: Aggregate the evidence-level stances (optionally augmented by document- or source-level features) to produce a global verdict (e.g., supported, refuted, or not enough information).

The standard data flow, as operationalized in (Hanselowski et al., 2019), is:

Input: claim $c$ (short text)
Retrieval: output top-N docs $D = \{d_1,\dots,d_N\}$
Extraction: for each $d_i$ , rank split sentences to assemble evidence set $E = \{e_1,\dots,e_K\}$
Stance: for each $e_j \in E$ , predict $\mathbf{p}_j$ over stance labels $\{support, refute, no-stance\}$
Aggregation: combine $\{\mathbf{p}_j\}$ (e.g., via pooling) and source features $q_i$ to produce the final claim label

Document-level systems in both multilingual (Martín et al., 2021) and domain-specific settings (Sharma et al., 24 Nov 2025) fit this abstract structure, with only the granularity and composition of the evidence module adjusted to address application needs.

2. Module Algorithms and Mathematical Objectives

The fact-checking pipeline integrates heterogeneous models at each stage, with explicit mathematical formulations guiding both training and inference:

Document Retrieval: Sparse retrieval uses TF–IDF or BM25, scoring via cosine or BM25:

$score(c,d) = \frac{\mathbf{v}_{TFIDF}(c) \cdot \mathbf{v}_{TFIDF}(d)}{\|\mathbf{v}_{TFIDF}(c)\|_2 \|\mathbf{v}_{TFIDF}(d)\|_2}$

Neural retrievers map claim and document to dense vectors via BERT or dual-encoder and compute similarity with dot product (Hanselowski et al., 2019, Karisani et al., 2024).

Evidence Extraction: Sentence ranking is approached either by lexical (TF–IDF) overlap or with neural models (BiLSTM, ESIM, Decomposable Attention). Ranking objectives often employ hinge loss:

$D = \{d_1,\dots,d_N\}$ 0

Here, $D = \{d_1,\dots,d_N\}$ 1 is a gold evidence sentence, $D = \{d_1,\dots,d_N\}$ 2 a negative, and $D = \{d_1,\dots,d_N\}$ 3 the scoring function.

Stance Detection: Models range from shallow MLPs with feature banks to transformer-based classifiers (e.g., DecompAtt, USE+Attention+MLP). All use cross-entropy loss:

$D = \{d_1,\dots,d_N\}$ 4

where $D = \{d_1,\dots,d_N\}$ 5 is the number of classes, $D = \{d_1,\dots,d_N\}$ 6 the target distribution, and $D = \{d_1,\dots,d_N\}$ 7 the softmax over class logits.

Claim Validation: Evidence stance vectors $D = \{d_1,\dots,d_N\}$ 8 are aggregated (mean, max, or attention pooling) as the input to an MLP or transformer encoder, again trained by cross-entropy (Hanselowski et al., 2019).

Advanced systems (e.g., MetaSumPerceiver (Chen et al., 2024)) fuse multimodal evidence via Perceiver IO, cross-attending between claim text, document chunks, and image embeddings, and optimizing both cross-entropy and RL-based entailment-focused rewards.

3. Evidence Aggregation and Final Decision Strategies

Evidence aggregation is a distinguishing aspect of document-level fact-checking, requiring both modeling and system-level heuristics:

Pooling Mechanisms: Evidence-level stance probabilities are pooled (e.g., concatenated, mean, max, or learned attention mechanisms), forming an evidence feature vector $D = \{d_1,\dots,d_N\}$ 9 for downstream claim classification (Hanselowski et al., 2019).
Source Reliability Integration: Aggregators can append or weight evidence by document-level features such as source reliability scores (e.g., domain-level trust, as in (Nadeem et al., 2019, Baly et al., 2018)):

$d_i$ 0

Here, $d_i$ 1 is a per-doc stance scalar; $d_i$ 2 is the source reliability weight.

Global Label Decision: Final labels are assigned using majority, maximum-probability, or logistic regression on the pooled evidence representation and source features. “Not enough information” can be triggered by a confidence threshold on evidence sufficiency (Hanselowski et al., 2019, Nadeem et al., 2019).

In some designs, global aggregation is handled by re-feeding concatenated or pooled evidence into a sentence-pair NLI model to predict the claim’s global label (Miranda et al., 2019).

4. Evaluation Metrics and Data Partitioning

Module and end-to-end pipeline performance is assessed via rigorously defined metrics:

Stage	Metric Definition	Typical Values/Thresholds
Retrieval	Recall@N, Precision@N	N=5, τ_doc = 0.2 (cosine)
Evidence Extraction	Precision@K, Recall@K	K=5
Stance/Validation	Macro-F1, Accuracy	F1_macro, per-class
Document-level	Macro-F1, Accuracy, DocScore = $d_i$ 3	Aggregate over claims

Performance on diverse, multi-domain test sets exposes error modes specific to document-level regimes not observed in sentence- or claim-only settings (Hanselowski et al., 2019).

5. Sources of Error and Robustness Enhancements

Empirical analysis reveals several persistent sources of error propagation (Hanselowski et al., 2019):

Lexical Mismatch: Paraphrasing between claims and evidence yields low lexical overlap, severely impacting sparse retrieval and extraction recall.
Class Imbalance: Under-representation of refute-class examples biases models toward support/no-stance.
Contradictory Evidence: Presence of both support and refute evidence within top-K sentences increases claim-level confusion.
Noise from Unreliable Domains: Heterogeneous domains and unreliable sources introduce misleading evidence.

To mitigate these, the blueprint prescribes several interventional strategies:

End-to-end or joint training across pipeline modules to allow error corrections to propagate backwards, reducing cascading failure.
Auxiliary semantic matching modules (e.g., sentence transformers) to enhance recall of paraphrased or semantically-aligned evidence.
Class-calibrated thresholds and cost-sensitive or focal losses to address bias toward dominant classes.
Learned attention-based aggregation to downweight contradictory or low-quality evidence.
Integration of document-level reliability or bias features.
Multi-task or shared-encoder learning to enhance generalization and exploit cross-task regularities.
Synthetic paraphrase/data augmentation and adversarial noise insertion for robustness to domain and phrasing variability.

6. Applicability, Generalization, and Future Directions

The pipeline as outlined in (Hanselowski et al., 2019) demonstrates substantial expressivity and modularity:

Applicability: The blueprint extends readily to both classic fact-checking benchmarks (e.g., FEVER), mixed domain news corpora, and specialized domains given appropriate document retrievers and stance models.
Generalization: Error analysis highlights the need for robust paraphrase understanding, imbalance mitigation, and explicit modeling of contradictory evidence–all of which motivate current directions such as multi-hop reasoning (see GraphCheck (Chen et al., 23 Feb 2025)) and multimodal fusion (see MetaSumPerceiver (Chen et al., 2024)).
Scalability: End-to-end training, parameter-efficient fine-tuning (e.g., LoRA adapters (Sharma et al., 24 Nov 2025)), and compact neural models (MiniCheck (Tang et al., 2024)) enable deployment at research and production scales.

A plausible implication is that future document-level fact-checking systems will increasingly integrate dense cross-encoder retrieval, graph-based evidence aggregation, hierarchical claim decomposition, dynamic multimodal fusion, and meta-learning for flexible domain adaptation.

References:

"A Richly Annotated Corpus for Different Tasks in Automated Fact-Checking" (Hanselowski et al., 2019)
"MetaSumPerceiver: Multimodal Multi-Document Evidence Summarization for Fact-Checking" (Chen et al., 2024)
"Automated Fact Checking in the News Room" (Miranda et al., 2019)
"FAKTA: An Automatic End-to-End Fact Checking System" (Nadeem et al., 2019)
"FISCAL: Financial Synthetic Claim-document Augmented Learning for Efficient Fact-Checking" (Sharma et al., 24 Nov 2025)
"FacTeR-Check: Semi-automated fact-checking through Semantic Similarity and Natural Language Inference" (Martín et al., 2021)
"OpenFactCheck: Building, Benchmarking Customized Fact-Checking Systems and Evaluating the Factuality of Claims and LLMs" (Wang et al., 2024)
"Document-level Claim Extraction and Decontextualisation for Fact-Checking" (Deng et al., 2024)
"Integrating Stance Detection and Fact Checking in a Unified Corpus" (Baly et al., 2018)
"GraphCheck: Breaking Long-Term Text Barriers with Extracted Knowledge Graph-Powered Fact-Checking" (Chen et al., 23 Feb 2025)
"Fact Checking Beyond Training Set" (Karisani et al., 2024)
"MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents" (Tang et al., 2024)