Contrastive Fact-Checking Reranker (CFR)
- The paper introduces CFR, which improves evidence retrieval for complex claims through contrastive learning and multi-source supervision.
- CFR re-ranks BM25 candidates using a fine-tuned dual-encoder architecture with InfoNCE loss and in-batch negatives for robust selection.
- Empirical results show CFR boosts retrieval metrics and veracity accuracy across diverse datasets in both in-domain and transfer settings.
The Contrastive Fact-Checking Reranker (CFR) is a retrieval architecture for fact-checking pipelines that enhances evidence selection for claims requiring nuanced, multi-hop inference. CFR is trained on the AVeriTeC dataset using contrastive learning with supervised, distilled, and answer-based signals, improving retrieval quality and downstream veracity judgments in real-world and transfer settings (Sriram et al., 7 Oct 2024).
1. System Architecture and Workflow
CFR builds on the Contriever dual-encoder framework (BERT-base, uncased, 110M parameters). Input processing involves concatenation of each claim with one of its annotated subquestions , producing the query representation with separator token insertion. Document candidates (comprising a 200-token window plus title) are encoded as . Scoring utilizes cosine similarity:
The pipeline operates in two stages:
- Initial retrieval of top- documents from the web (via Bing) using BM25 over .
- CFR reranks the candidates via the fine-tuned Contriever encoder.
The final top- (–$10$) documents are supplied to a reader module (GPT-4 or similar LM) for evidence extraction and claim veracity decision. This multi-step pipeline addresses both direct and indirect evidence relationships critical for complex claims.
2. Contrastive Training Objectives
CFR applies the InfoNCE contrastive loss to optimize its evidence selection. For a given query , each training tuple includes the query embedding , positives , and negatives :
where is set to 0.05. Multiple positive documents lead to additive terms in the numerator, whereas all negatives enter only the denominator. In-batch negatives are leveraged to further improve training efficiency and the discrimination of non-relevant evidence.
3. Supervision Signal Design
Three complementary sources of supervision construct positive and negative training sets:
- Gold-label supervision (AVeriTeC evidence): Positives comprise paragraph windows extracted from human-annotated gold evidence articles. Negatives are BM25 hits not contained in the annotated gold evidence.
- GPT-4 Distillation (“relevance”): BM25 candidates are classified by zero-shot GPT-4 prompts ("Does this passage contain relevant evidence to answer subquestion q? Yes/No"). Responses populate (for “Yes”) and (for “No”).
- Subquestion answer evaluation (LERC-based): For top-15 BM25 candidates, GPT-4 generates an answer , which, together with the gold answer , is abstracted (via ). The learned metric is computed, with indicating high-quality positives (), and indicating strong negatives ().
The final training pool is:
where aggregates gold evidence and distillation signals.
4. Data Construction, Input Representation, and Sampling
Claims and subquestions are concatenated up to 128 tokens. Documents are chunked into 200-token spans with a stride of 100 tokens, each preceded by a title. For each claim-subquestion pair, a candidate pool consists of approximately 500 documents sourced from BM25 (web + gold) and synthetic negatives. Training tuples are generated for all pairs of positives and negatives. Shuffling and batching (e.g., batch size of 32) introduce in-batch negatives. Hard negatives are selected from BM25 candidates labeled "No" by GPT-4, those with low LERC scores, or synthetic negatives for one-hop reasoning.
5. Training Procedure and Hyperparameters
CFR uses AdamW with a learning rate of (grid-searched), batch size of 32, and 12 epochs (≈3 GPU-hours on dual RTX 8000). Loss computation relies on the InfoNCE framework with the described multi-source supervision and in-batch negatives. Training interleaves tuples from the three supervision signals.
6. Evaluation Strategies and Metrics
Evaluation employs five datasets (n=200 held-out examples each): AVeriTeC, ClaimDecomp, FEVER, HotpotQA, and a synthetic reasoning dataset. Retrieval performance is measured using:
- LERC score: Average LERC between gold and predicted answer from top-1 document.
- Top-doc relevance: Fraction of top-1 documents GPT-4 judges as “Yes” relevant.
- Gold@10: Proportion of instances with gold evidence in the top-10 retrieved.
End-to-end effectiveness is assessed by veracity accuracy: fraction of claims correctly labeled (Support/Refute/NEI) by the reader/verifier.
7. Empirical Results and Ablation Analyses
In-domain (AVeriTeC): CFR yields consistent improvements over baseline Contriever:
- LERC: 0.53 (+5 points)
- Top-doc relevance: 0.62 (+8 points)
- Gold@10: 0.59 (+9 points)
- Veracity Accuracy: 0.60 (+6 points)
Transfer and Synthetic Experiments:
- ClaimDecomp: Top-doc relevance from 0.32 → 0.37, veracity 0.32 → 0.34.
- FEVER: Top-doc relevance 0.49 → 0.57; veracity 0.58 → 0.63.
- HotpotQA: LERC 0.33 → 0.36; top-doc relevance 0.27 → 0.32.
- Synthetic one-hop reasoning: Mean reciprocal rank (MRR) 0.68 (Contriever) → 0.79 (CFR).
Ablation studies reveal the interplay of signals:
- “Distill” only: Highest top-doc relevance (0.63), moderate veracity (0.55).
- “LERC” only: Highest veracity (0.60), lower top-doc relevance (0.56).
- Combination (“distill+LERC”): Best overall upstream and downstream accuracy.
Qualitative retrieval shows improvement in evidence selection for nuanced subquestions, e.g., CFR retrieves contextually relevant passages (Amtrak CEO funding comment) and directly addresses subquestions about vaccine development details.
8. Extensions and Observations
The CFR multi-signal contrastive framework is compatible with arbitrary black-box teachers (LLMs) or answer-equivalence metrics, and LERC-derived signals enhance retrieval for non-factoid, long-answer subquestions. Synthetic negatives further improve the model’s discrimination for one-hop or indirect evidence. Domain-specific decomposition of subquestions or additional expert supervision may increase reranking quality. A plausible implication is that future systems can extend CFR’s paradigm to new domains with flexible incorporation of external annotation and reasoning strategies for robust fact-checking retrieval (Sriram et al., 7 Oct 2024).