CheckThat! Task 2: Verified Claim Retrieval

Updated 15 September 2025

CheckThat! Task 2 is a fact-checking task that retrieves and ranks verified claims to validate new social media assertions.
It employs methods such as embedding-based similarity, neural classification, and hybrid IR approaches to identify semantic equivalence.
Evaluated using MAP@5 and related metrics, the task enhances real-time misinformation mitigation by reducing redundant fact-checking.

CheckThat! Task 2 is a central component in the fact-checking evaluation pipeline, focused on “Verified Claim Retrieval”—the automatic identification of prior fact-checked claims that can be used to verify new, check-worthy assertions in social media posts. The task has evolved as part of the CLEF CheckThat! Lab, integrating advances in natural language processing to address the high variability, brevity, and informal style inherent to social media content. The retrieval of prior fact-checked claims expedites the validation process, reduces redundancy in fact-checkers’ efforts, and supports real-time mitigation of misinformation.

1. Task Definition and Operational Framework

Task 2 is formally defined as a retrieval and ranking problem. Given:

An input claim $c$ (typically extracted from a tweet or a candidate check-worthy post).
A set of previously fact-checked claims $V_c = \{v_1, v_2, ..., v_n\}$ .

For each pair $(c, v_i)$ , the system must determine whether $v_i$ is “Relevant” (i.e., it covers the same claim—or a sub-claim—as $c$ and could obviate the need to fact-check $c$ anew), or “Irrelevant.” The system ranks all $v_i$ in $V_c$ for each input claim $c$ so that all Relevant claims are ideally placed at the top.

This problem is characterized as being closely aligned with paraphrasing, textual similarity, and textual entailment, where semantic and pragmatic equivalence rather than surface-level matching must be detected.

2. Methodological Paradigms

Task 2 enables a variety of algorithmic and modeling strategies:

A. Embedding-Based Similarity

Both $c$ and $v_i$ are embedded (e.g., via static word vectors or contextualized transformers such as BERT) into a continuous vector space.
Similarity between $c$ and $v_i$ is computed, most commonly using cosine similarity:

$S(c, v_i) = \frac{E(c) \cdot E(v_i)}{\|E(c)\| \|E(v_i)\|}$

where $E(\cdot)$ denotes the embedding function.

Verified claims are ordered by $S(c, v_i)$ , producing a ranked list.

B. Neural Classification and Textual Entailment

The task can be reframed as a classification or entailment problem: for each $(c, v_i)$ , a neural network (such as BERT or a fine-tuned transformer) predicts whether $v_i$ “verifies” $c$ .
Training requires labeled supervision for $(c, v_i)$ pairs.
Ranking can be directly derived from the network’s output scores, with margin- or ranking-based loss functions (e.g., contrastive or triplet loss) used to emphasize the separation of Relevant from Irrelevant pairs.

C. Classical Feature Engineering

Some systems engineer features such as:
- Bag-of-words or n-gram overlaps.
- Named entity matches.
- Lexical and semantic similarity metrics.
These features are provided to classifiers (e.g. SVMs), which output ranking scores for the $(c, v_i)$ pairs.

D. Hybrid and Cascade Architectures

Unsupervised IR techniques (TF–IDF, BM25, ElasticSearch) often provide an initial candidate shortlist, which is subsequently re-ranked with more computationally intensive neural models for final ordering.
Ensemble and cascade approaches are prevalent, leveraging retrieval efficiency and semantic re-ranking accuracy.

3. Evaluation Metrics

As a ranking task, Task 2 is evaluated with metrics sensitive to the order of retrieved claims:

Metric	Description
MAP@5	Mean Average Precision at cutoff 5; primary metric.
MRR	Mean Reciprocal Rank; measures how soon the first relevant item is retrieved.
MAP@k, Recall@k	Additional MAP and recall at cutoffs k = 3, 10, 20.

The general formula for MAP is:

$\operatorname{MAP} = \frac{1}{|Q|} \sum_{q \in Q} \frac{1}{m_q} \sum_{k=1}^{N_q} P_q(k) \cdot \text{rel}_q(k)$

where $Q$ is the set of queries, $m_q$ is the number of relevant items for query $q$ , $P_q(k)$ is the precision at cutoff $k$ for query $q$ , and $\text{rel}_q(k)$ is 1 if the response at $k$ is relevant.

4. Representative System Architectures and Findings

The 2020 and 2021 CheckThat! labs featured a spectrum of system designs:

BM25/ElasticSearch as strong retrieval baselines that are surprisingly competitive, especially when used for initial pruning (Barron-Cedeno et al., 2020).
Transformer-based semantic similarity: Fine-tuned BERT, RoBERTa, and, notably, Sentence-BERT models, achieved top MAP@5 performance by learning shared semantic spaces for input–claim and verified–claim pairs. For example, the Buster.AI and UNIPI-NLE teams demonstrated that cascade fine-tuning (a regression step for cosine similarity followed by binary classification) and use of triplet losses were particularly effective (Barron-Cedeno et al., 2020).
Ensemble learning-to-rank frameworks (e.g., LambdaMART) combining deep representations with IR derived features improved over either approach in isolation, with absolute MAP@5 gains (on the order of 13.4 points for English) relative to strong baselines (Nakov et al., 2021).
Handling large candidate sets efficiently: Retrieval over 10,000 candidate claims was facilitated by two-stage (IR-first, neural-re-rank-second) architectures. KD-tree and approximate nearest-neighbor search were used to scale dense vector retrieval (Cheema et al., 2020).
Performance Benchmarks: Top systems reached MAP@5 in the range [0.80, 0.93], significantly surpassing basic ElasticSearch, which typically scored lower (Barron-Cedeno et al., 2020).

5. Key Challenges and Practical Considerations

Lexical/Syntactic Divergence: The same fact may be phrased very differently across claims, rendering surface-level overlap inadequate. The transition to contextualized semantic embeddings (transformers) was essential in overcoming this issue.
Candidate Explosion: Extremely large candidate sets necessitate efficient initial pruning, without which even state-of-the-art neural models cannot scale.
Class Imbalance: Relevant verified claims are sparse. Sampling strategies and tailored loss functions such as margin-based or triplet loss were important to mitigate skew.
Practical Trade-Offs: Neural semantic scoring provides the best accuracy but is computationally more demanding; hybrid approaches balance retrieval speed and ranking quality (Barron-Cedeno et al., 2020, Nakov et al., 2021).
Cross-lingual Generalization and Cultural Bias: Recent efforts explored multilingual models and auxiliary learning tasks (e.g., language identification) to mitigate bias and improve robustness across cultural contexts (Schlicht et al., 2021).

6. Impact, Evaluation Practice, and Future Directions

Task 2 fundamentally advances automated fact-checking pipelines by minimizing redundant effort, enabling “just-in-time” claim validation, and focusing fact-checker attention on novel, previously unverifiable assertions. The use of MAP@5 and related metrics provides comprehensive feedback on not just retrieval accuracy but re-ranking quality.

Rankings obtained by participating systems illustrate that accurate, high–recall retrieval of verified claims is currently feasible, especially in high-resource languages and data settings where relevant fact-checks are likely to exist. Key improvements are expected in:

Further integration of multilingual, cross-lingual, and culture-aware models (Schlicht et al., 2021).
Data augmentation and enrichment to bridge paraphrasing and topic drift.
More interpretable and explainable ranking outputs for practical deployment in fact-checking organizations.
Scaling and adapting to low-resource settings and languages.

Task 2 remains a critical research frontier for operationalizing NLP technologies in real-world misinformation environments driven by social media dynamics.