Contrastive Fact-Checking Reranker (CFR)

Updated 5 December 2025

The paper introduces CFR, which improves evidence retrieval for complex claims through contrastive learning and multi-source supervision.
CFR re-ranks BM25 candidates using a fine-tuned dual-encoder architecture with InfoNCE loss and in-batch negatives for robust selection.
Empirical results show CFR boosts retrieval metrics and veracity accuracy across diverse datasets in both in-domain and transfer settings.

The Contrastive Fact-Checking Reranker (CFR) is a retrieval architecture for fact-checking pipelines that enhances evidence selection for claims requiring nuanced, multi-hop inference. CFR is trained on the AVeriTeC dataset using contrastive learning with supervised, distilled, and answer-based signals, improving retrieval quality and downstream veracity judgments in real-world and transfer settings (Sriram et al., 7 Oct 2024).

1. System Architecture and Workflow

CFR builds on the Contriever dual-encoder framework (BERT-base, uncased, 110M parameters). Input processing involves concatenation of each claim $c_i$ with one of its annotated subquestions $q_{ij}$ , producing the query representation $h_y = \mathrm{Encoder}([c_i; q_{ij}])$ with separator token insertion. Document candidates $d$ (comprising a 200-token window plus title) are encoded as $h_d = \mathrm{Encoder}(d)$ . Scoring utilizes cosine similarity:

$f(h_y, h_d) = \frac{h_y^\top h_d}{\lVert h_y \rVert\ \lVert h_d \rVert}$

The pipeline operates in two stages:

Initial retrieval of top- $K$ documents from the web (via Bing) using BM25 over $[c; q]$ .
CFR reranks the $K$ candidates via the fine-tuned Contriever encoder.

The final top- $M$ ( $M \approx 1$ –$10$) documents are supplied to a reader module (GPT-4 or similar LM) for evidence extraction and claim veracity decision. This multi-step pipeline addresses both direct and indirect evidence relationships critical for complex claims.

2. Contrastive Training Objectives

CFR applies the InfoNCE contrastive loss to optimize its evidence selection. For a given query $y$ , each training tuple includes the query embedding $h_y$ , positives $D^+ = \{d^+_1, \dots, d^+_p\}$ , and negatives $D^- = \{d^-_1, \dots, d^-_n\}$ :

$\mathcal{L}_{\text{InfoNCE}}(y; D^+, D^-) = -\sum_{d^+ \in D^+} \log \left( \frac{ \exp(f(h_y, h_{d^+})/\tau) } { \sum_{d' \in D^+ \cup D^-} \exp( f(h_y, h_{d'})/\tau ) } \right)$

where $\tau$ is set to 0.05. Multiple positive documents lead to additive terms in the numerator, whereas all negatives enter only the denominator. In-batch negatives are leveraged to further improve training efficiency and the discrimination of non-relevant evidence.

3. Supervision Signal Design

Three complementary sources of supervision construct positive and negative training sets:

Gold-label supervision (AVeriTeC evidence): Positives comprise paragraph windows extracted from human-annotated gold evidence articles. Negatives are BM25 hits not contained in the annotated gold evidence.
GPT-4 Distillation (“relevance”): BM25 candidates are classified by zero-shot GPT-4 prompts ("Does this passage contain relevant evidence to answer subquestion q? Yes/No"). Responses populate $D_d^+$ (for “Yes”) and $D_d^-$ (for “No”).
Subquestion answer evaluation (LERC-based): For top-15 BM25 candidates, GPT-4 generates an answer $a_{ij}$ , which, together with the gold answer $a^g_{ij}$ , is abstracted (via $s(\cdot)$ ). The learned metric $\mathrm{LERC}(a_{\text{short}}, a^g_{\text{short}}) \in [0,1]$ is computed, with $>0.7$ indicating high-quality positives ( $D_l^+$ ), and $<0.3$ indicating strong negatives ( $D_l^-$ ).

The final training pool is:

$D^+ = D_{dg}^+ \cup D_l^+,\quad D^- = D_{dg}^-$

where $D_{dg}$ aggregates gold evidence and distillation signals.

4. Data Construction, Input Representation, and Sampling

Claims and subquestions are concatenated up to 128 tokens. Documents are chunked into 200-token spans with a stride of 100 tokens, each preceded by a title. For each claim-subquestion pair, a candidate pool consists of approximately 500 documents sourced from BM25 (web + gold) and synthetic negatives. Training tuples are generated for all $p \times n$ pairs of positives and negatives. Shuffling and batching (e.g., batch size of 32) introduce in-batch negatives. Hard negatives are selected from BM25 candidates labeled "No" by GPT-4, those with low LERC scores, or synthetic negatives for one-hop reasoning.

5. Training Procedure and Hyperparameters

CFR uses AdamW with a learning rate of $2 \times 10^{-5}$ (grid-searched), batch size of 32, and 12 epochs (≈3 GPU-hours on dual RTX 8000). Loss computation relies on the InfoNCE framework with the described multi-source supervision and in-batch negatives. Training interleaves tuples from the three supervision signals.

6. Evaluation Strategies and Metrics

Evaluation employs five datasets (n=200 held-out examples each): AVeriTeC, ClaimDecomp, FEVER, HotpotQA, and a synthetic reasoning dataset. Retrieval performance is measured using:

LERC score: Average LERC between gold and predicted answer from top-1 document.
Top-doc relevance: Fraction of top-1 documents GPT-4 judges as “Yes” relevant.
Gold@10: Proportion of instances with gold evidence in the top-10 retrieved.

End-to-end effectiveness is assessed by veracity accuracy: fraction of claims correctly labeled (Support/Refute/NEI) by the reader/verifier.

7. Empirical Results and Ablation Analyses

In-domain (AVeriTeC): CFR yields consistent improvements over baseline Contriever:

LERC: 0.53 (+5 points)
Top-doc relevance: 0.62 (+8 points)
Gold@10: 0.59 (+9 points)
Veracity Accuracy: 0.60 (+6 points)

Transfer and Synthetic Experiments:

ClaimDecomp: Top-doc relevance from 0.32 → 0.37, veracity 0.32 → 0.34.
FEVER: Top-doc relevance 0.49 → 0.57; veracity 0.58 → 0.63.
HotpotQA: LERC 0.33 → 0.36; top-doc relevance 0.27 → 0.32.
Synthetic one-hop reasoning: Mean reciprocal rank (MRR) 0.68 (Contriever) → 0.79 (CFR).

Ablation studies reveal the interplay of signals:

“Distill” only: Highest top-doc relevance (0.63), moderate veracity (0.55).
“LERC” only: Highest veracity (0.60), lower top-doc relevance (0.56).
Combination (“distill+LERC”): Best overall upstream and downstream accuracy.

Qualitative retrieval shows improvement in evidence selection for nuanced subquestions, e.g., CFR retrieves contextually relevant passages (Amtrak CEO funding comment) and directly addresses subquestions about vaccine development details.

8. Extensions and Observations

The CFR multi-signal contrastive framework is compatible with arbitrary black-box teachers (LLMs) or answer-equivalence metrics, and LERC-derived signals enhance retrieval for non-factoid, long-answer subquestions. Synthetic negatives further improve the model’s discrimination for one-hop or indirect evidence. Domain-specific decomposition of subquestions or additional expert supervision may increase reranking quality. A plausible implication is that future systems can extend CFR’s paradigm to new domains with flexible incorporation of external annotation and reasoning strategies for robust fact-checking retrieval (Sriram et al., 7 Oct 2024).

PDF Markdown Chat (Pro)

References (1)

Contrastive Learning to Improve Retrieval for Real-world Fact Checking (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Contrastive Fact-Checking Reranker (CFR).