Papers
Topics
Authors
Recent
2000 character limit reached

Contrastive Fact-Checking Reranker (CFR)

Updated 5 December 2025
  • The paper introduces CFR, which improves evidence retrieval for complex claims through contrastive learning and multi-source supervision.
  • CFR re-ranks BM25 candidates using a fine-tuned dual-encoder architecture with InfoNCE loss and in-batch negatives for robust selection.
  • Empirical results show CFR boosts retrieval metrics and veracity accuracy across diverse datasets in both in-domain and transfer settings.

The Contrastive Fact-Checking Reranker (CFR) is a retrieval architecture for fact-checking pipelines that enhances evidence selection for claims requiring nuanced, multi-hop inference. CFR is trained on the AVeriTeC dataset using contrastive learning with supervised, distilled, and answer-based signals, improving retrieval quality and downstream veracity judgments in real-world and transfer settings (Sriram et al., 7 Oct 2024).

1. System Architecture and Workflow

CFR builds on the Contriever dual-encoder framework (BERT-base, uncased, 110M parameters). Input processing involves concatenation of each claim cic_i with one of its annotated subquestions qijq_{ij}, producing the query representation hy=Encoder([ci;qij])h_y = \mathrm{Encoder}([c_i; q_{ij}]) with separator token insertion. Document candidates dd (comprising a 200-token window plus title) are encoded as hd=Encoder(d)h_d = \mathrm{Encoder}(d). Scoring utilizes cosine similarity:

f(hy,hd)=hyhdhy hdf(h_y, h_d) = \frac{h_y^\top h_d}{\lVert h_y \rVert\ \lVert h_d \rVert}

The pipeline operates in two stages:

  1. Initial retrieval of top-KK documents from the web (via Bing) using BM25 over [c;q][c; q].
  2. CFR reranks the KK candidates via the fine-tuned Contriever encoder.

The final top-MM (M1M \approx 1–$10$) documents are supplied to a reader module (GPT-4 or similar LM) for evidence extraction and claim veracity decision. This multi-step pipeline addresses both direct and indirect evidence relationships critical for complex claims.

2. Contrastive Training Objectives

CFR applies the InfoNCE contrastive loss to optimize its evidence selection. For a given query yy, each training tuple includes the query embedding hyh_y, positives D+={d1+,,dp+}D^+ = \{d^+_1, \dots, d^+_p\}, and negatives D={d1,,dn}D^- = \{d^-_1, \dots, d^-_n\}:

LInfoNCE(y;D+,D)=d+D+log(exp(f(hy,hd+)/τ)dD+Dexp(f(hy,hd)/τ))\mathcal{L}_{\text{InfoNCE}}(y; D^+, D^-) = -\sum_{d^+ \in D^+} \log \left( \frac{ \exp(f(h_y, h_{d^+})/\tau) } { \sum_{d' \in D^+ \cup D^-} \exp( f(h_y, h_{d'})/\tau ) } \right)

where τ\tau is set to 0.05. Multiple positive documents lead to additive terms in the numerator, whereas all negatives enter only the denominator. In-batch negatives are leveraged to further improve training efficiency and the discrimination of non-relevant evidence.

3. Supervision Signal Design

Three complementary sources of supervision construct positive and negative training sets:

  • Gold-label supervision (AVeriTeC evidence): Positives comprise paragraph windows extracted from human-annotated gold evidence articles. Negatives are BM25 hits not contained in the annotated gold evidence.
  • GPT-4 Distillation (“relevance”): BM25 candidates are classified by zero-shot GPT-4 prompts ("Does this passage contain relevant evidence to answer subquestion q? Yes/No"). Responses populate Dd+D_d^+ (for “Yes”) and DdD_d^- (for “No”).
  • Subquestion answer evaluation (LERC-based): For top-15 BM25 candidates, GPT-4 generates an answer aija_{ij}, which, together with the gold answer aijga^g_{ij}, is abstracted (via s()s(\cdot)). The learned metric LERC(ashort,ashortg)[0,1]\mathrm{LERC}(a_{\text{short}}, a^g_{\text{short}}) \in [0,1] is computed, with >0.7>0.7 indicating high-quality positives (Dl+D_l^+), and <0.3<0.3 indicating strong negatives (DlD_l^-).

The final training pool is:

D+=Ddg+Dl+,D=DdgD^+ = D_{dg}^+ \cup D_l^+,\quad D^- = D_{dg}^-

where DdgD_{dg} aggregates gold evidence and distillation signals.

4. Data Construction, Input Representation, and Sampling

Claims and subquestions are concatenated up to 128 tokens. Documents are chunked into 200-token spans with a stride of 100 tokens, each preceded by a title. For each claim-subquestion pair, a candidate pool consists of approximately 500 documents sourced from BM25 (web + gold) and synthetic negatives. Training tuples are generated for all p×np \times n pairs of positives and negatives. Shuffling and batching (e.g., batch size of 32) introduce in-batch negatives. Hard negatives are selected from BM25 candidates labeled "No" by GPT-4, those with low LERC scores, or synthetic negatives for one-hop reasoning.

5. Training Procedure and Hyperparameters

CFR uses AdamW with a learning rate of 2×1052 \times 10^{-5} (grid-searched), batch size of 32, and 12 epochs (≈3 GPU-hours on dual RTX 8000). Loss computation relies on the InfoNCE framework with the described multi-source supervision and in-batch negatives. Training interleaves tuples from the three supervision signals.

6. Evaluation Strategies and Metrics

Evaluation employs five datasets (n=200 held-out examples each): AVeriTeC, ClaimDecomp, FEVER, HotpotQA, and a synthetic reasoning dataset. Retrieval performance is measured using:

  • LERC score: Average LERC between gold and predicted answer from top-1 document.
  • Top-doc relevance: Fraction of top-1 documents GPT-4 judges as “Yes” relevant.
  • Gold@10: Proportion of instances with gold evidence in the top-10 retrieved.

End-to-end effectiveness is assessed by veracity accuracy: fraction of claims correctly labeled (Support/Refute/NEI) by the reader/verifier.

7. Empirical Results and Ablation Analyses

In-domain (AVeriTeC): CFR yields consistent improvements over baseline Contriever:

  • LERC: 0.53 (+5 points)
  • Top-doc relevance: 0.62 (+8 points)
  • Gold@10: 0.59 (+9 points)
  • Veracity Accuracy: 0.60 (+6 points)

Transfer and Synthetic Experiments:

  • ClaimDecomp: Top-doc relevance from 0.32 → 0.37, veracity 0.32 → 0.34.
  • FEVER: Top-doc relevance 0.49 → 0.57; veracity 0.58 → 0.63.
  • HotpotQA: LERC 0.33 → 0.36; top-doc relevance 0.27 → 0.32.
  • Synthetic one-hop reasoning: Mean reciprocal rank (MRR) 0.68 (Contriever) → 0.79 (CFR).

Ablation studies reveal the interplay of signals:

  • “Distill” only: Highest top-doc relevance (0.63), moderate veracity (0.55).
  • “LERC” only: Highest veracity (0.60), lower top-doc relevance (0.56).
  • Combination (“distill+LERC”): Best overall upstream and downstream accuracy.

Qualitative retrieval shows improvement in evidence selection for nuanced subquestions, e.g., CFR retrieves contextually relevant passages (Amtrak CEO funding comment) and directly addresses subquestions about vaccine development details.

8. Extensions and Observations

The CFR multi-signal contrastive framework is compatible with arbitrary black-box teachers (LLMs) or answer-equivalence metrics, and LERC-derived signals enhance retrieval for non-factoid, long-answer subquestions. Synthetic negatives further improve the model’s discrimination for one-hop or indirect evidence. Domain-specific decomposition of subquestions or additional expert supervision may increase reranking quality. A plausible implication is that future systems can extend CFR’s paradigm to new domains with flexible incorporation of external annotation and reasoning strategies for robust fact-checking retrieval (Sriram et al., 7 Oct 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Contrastive Fact-Checking Reranker (CFR).