AFEV: Iterative Extraction & Verification
- AFEV is a fact-checking paradigm that decomposes complex claims into atomic factual units and verifies them through structured, iterative processing.
- It employs dense retrieval and cross-encoder reranking to select relevant evidence and dynamic in-context demonstrations for precise verification.
- The approach mitigates error propagation through closed-loop iterative refinement and transparent aggregation of subclaim results.
Iterative Extraction and Verification (AFEV) is a modular fact-checking paradigm in which a complex claim is decomposed into atomic factual units and verified for veracity through fine-grained evidence retrieval, adaptive in-context reasoning, and dynamic aggregation of intermediate results. AFEV is designed to mitigate reasoning failures, control error propagation, and enhance interpretability in verification of multi-hop, compositional, and adversarial claims.
1. Formal Structure and Notation
Let denote a complex natural-language claim and an external textual corpus. The AFEV pipeline is defined by the following iterative sequence:
- Decomposition: , where each is an atomic fact extracted from via auto-regressive or conditional generation.
- Evidence Retrieval: For each ,
- Retrieve using dense dual-encoder cosine similarity:
- Rerank to select using a cross-encoder reranker trained with InfoNCE loss:
where is the true evidence and negatives.
- Dynamic Demonstration Selection: For each , select dynamic context-specific demonstrations from a database of labeled claims , maximizing semantic similarity with .
- Reasoning: For each , aggregate its retrieved evidence and demonstrations to produce a fact-level label and rationale :
- Iterative Refinement: Fact extraction for is conditioned on previous outputs:
until a coverage-based STOP criterion is met.
- Aggregation and Final Decision: The set and rationales are composed to form the overall verdict :
This design facilitates both interpretability and dynamic correction by tightly coupling decomposition, retrieval, and verification.
2. Iterative Algorithmic Workflow
The full AFEV protocol is implemented as a closed-loop iterative process:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
def AFEV_FactVerification(C, D, ℂ): F_list, Y_list, R_list = [], [], [] while True: F_t = Extractor(C, F_list, Y_list, R_list) if F_t == "STOP": break F_list.append(F_t) E_t_prime = retrieve_top_k(D, F_t) E_t = Reranker(E_t_prime, F_t) A_t = select_demonstrations(ℂ, F_t) y_t, r_t = Reasoner(F_t, C, E_t, A_t) Y_list.append(y_t) R_list.append(r_t) y_star = Aggregate(F_list, Y_list, R_list) return y_star, Y_list, R_list |
- Each extraction step exploits prior subclaim-verification pairs, closing the error-propagation loop.
- The iterative approach ends adaptively once the atomic fact set collectively covers the full semantic content of .
3. Dynamic Refinement and Error Control
AFEV addresses error accumulation through:
- Closed-loop Extraction: Each new atomic fact is generated with explicit conditioning on the entire verified/debated history (). This dynamically corrects prior faulty decompositions and prevents irrecoverable branching errors.
- STOP Criterion: Extraction halts when all semantic units of the original claim are accounted for, preventing both under- and over-decomposition.
- Supervised Evidence Filtering: A learned reranker suppresses low-quality or semantically irrelevant evidence early in the pipeline, minimizing contamination at the reasoning phase.
These mechanisms jointly prevent uncontrolled noise propagation widely observed in one-shot decomposition strategies.
4. Fine-Grained Retrieval and In-Context Demonstrations
Retrieval is performed in a two-stage manner:
- Dense Retrieval: A bi-encoder computes vector representations and ranks all candidates; is typically set to 5.
- Cross-Encoder Reranking: Top- passages undergo pairwise reranking. The best evidence sentences (usually ) are retained for the reasoning module.
- Dynamic Demonstrations: For each , previously validated claims with highest semantic similarity serve as in-context examples (typically or $2$). This on-the-fly instance retrieval aligns LLM behavior with the specifics of the current subclaim.
The above provides both coverage and context specificity for nuanced multi-hop verification.
5. Benchmark Experiments and Empirical Results
Evaluation across LIAR-PLUS, HOVER, PolitiHop, RAWFC, and LIAR benchmarks demonstrates robust gains:
| Dataset | Baseline LA / F1 | AFEV LA / F1 |
|---|---|---|
| LIAR-PLUS | 82.10 / 80.78 | 83.73 / 83.12 |
| HOVER | 76.98 / 76.89 | 78.87 / 78.76 |
| PolitiHop | 72.34 / 55.80 | 74.14 / 57.69 |
| RAWFC (F1) | 57.3 | 60.2 |
| LIAR (F1) | 42.0 | 43.9 |
Ablation studies confirm that atomic fact extraction (+1.1 to +1.9 F1), iterative extraction (+0.5 to +1.1), rationales, reranking, and demonstrations all yield measurable improvements. Optimal performance is achieved with –$2$ reranked evidences and –$2$ demonstrations per fact-level query. Efficiency is preserved; the closed-loop variant increases runtime by compared to one-shot baselines.
6. Interpretability and Case Analysis
AFEV natively produces a fine-grained audit trail:
- Each subclaim is paired with the retrieved evidence, in-context demonstration, rationale, and atomic label.
- Intermediate errors or ambiguities can be directly traced to precise subcomponents, facilitating targeted correction.
Case studies illustrate complex sports-statistics claims disassembled into player-year facts, with independent count-verification and explicit cross-references, ultimately yielding a transparent, human-readable decision sequence.
7. Significance, Limitations, and Future Directions
AFEV demonstrates that iterative, atomic-factorization and adaptive reasoning outperform static or monolithic pipelines, particularly for multi-hop, ambiguous, or adversarial claims (Zheng et al., 9 Jun 2025). The explicit, chained reasoning and evidence tracking address key issues in factual verification: brittle error propagation, noisy retrieval, and interpretability bottlenecks.
Limitations include increased dependence on retrieval quality, potential slowdowns for claims requiring many atomic units, and sensitivity to noise in the demonstration database. Future work proposed in (Zheng et al., 9 Jun 2025) includes tighter couplings with retrieval-learning objectives, multi-hop reasoning extension with deep aggregation, and application of the retrieve–edit–aggregate loop to heterogeneous domains (e.g., code, scientific fact-checking). Empirical evidence from recent multi-hop benchmarks supports continued development in this direction.