Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 188 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 39 tok/s Pro

GPT-5 High 39 tok/s Pro

GPT-4o 78 tok/s Pro

Kimi K2 207 tok/s Pro

GPT OSS 120B 446 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

TIFIN India at SemEval-2025: Harnessing Translation to Overcome Multilingual IR Challenges in Fact-Checked Claim Retrieval (2504.16627v1)

Published 23 Apr 2025 in cs.CL

Abstract: We address the challenge of retrieving previously fact-checked claims in monolingual and crosslingual settings - a critical task given the global prevalence of disinformation. Our approach follows a two-stage strategy: a reliable baseline retrieval system using a fine-tuned embedding model and an LLM-based reranker. Our key contribution is demonstrating how LLM-based translation can overcome the hurdles of multilingual information retrieval. Additionally, we focus on ensuring that the bulk of the pipeline can be replicated on a consumer GPU. Our final integrated system achieved a success@10 score of 0.938 and 0.81025 on the monolingual and crosslingual test sets, respectively.

Summary

Multilingual Fact-Checked Claim Retrieval in Information Retrieval Tasks

The paper, "TIFIN India at SemEval-2025: Harnessing Translation to Overcome Multilingual IR Challenges in Fact-Checked Claim Retrieval," addresses a significant challenge in information retrieval (IR): the retrieval of fact-checked claims across diverse linguistic contexts. This is a pertinent issue given the global spread of misinformation through social media platforms, as recognized by entities like the World Economic Forum, which identifies misinformation as a critical global threat. The paper presented in the paper was conducted under the auspices of SemEval-2025 Shared Task 7, focusing on multilingual systems that identify previously fact-checked claims relevant to social media posts across multiple languages.

Overview of Methodology

The authors of this paper employ a novel two-stage strategy in their approach to multilingual IR. Initially, they utilize a reliable retrieval system based on a fine-tuned embedding model. Subsequently, an LLM-based reranker is applied to improve the ranking of retrieved documents. The most innovative aspect of their method is the use of LLMs to translate multilingual social media posts into English, thus circumventing the complexities of multilingual retrieval by focusing on a single language.

Data Augmentation and Translation: The strategy involves translating all social media posts into English using the Aya Expanse model, which has been empirically shown to outperform conventional systems like Google Translate. This choice is crucial, given the multilingual nature of the dataset comprising 27 social media languages. By translating posts into English, the system is able to leverage well-developed models in English, allowing for more accurate IR without the need for language-specific models. However, this translation-based approach comes with the risk of semantic loss, albeit the paper argues that the translation helps in cross-lingual pattern identification, especially for languages with limited resources.

Re-ranking with LLMs: The paper explores various re-ranking techniques including cross-encoders and Colbert v2, ultimately settling on the Qwen 2.5 72B Instruct model owing to its superior performance. This model's capabilities in semantic understanding and vast world knowledge add value in determining relevance in the retrieved information.

Implementation and Results: The system integrates the retrieval and ranking steps effectively. It achieves notable success@10 scores, with 0.938 and 0.81025 for monolingual and cross-lingual tests, respectively. These results demonstrate the validity and utility of using translation and LLMs in multilingual IR tasks, providing robust retrieval performance while maintaining manageable computational resource requirements.

Implications and Future Directions

The implications of this paper are broad for both practical applications and theoretical advancements in AI and IR. Practically, the model offers a scalable solution to a pressing issue—global misinformation. It caters to the challenges faced by fact-checkers who need to navigate and verify claims across linguistic boundaries. Theoretically, the approach highlights the potential for LLM-based translation in overcoming linguistic barriers in information retrieval.

Despite the promising results, the paper acknowledges limitations, primarily the constraints posed by available computational resources which affected the assessment of larger models. Future avenues for research could involve exploring the generalizability of such systems across different datasets, extending techniques to larger and more diverse corpora, or optimizing models further through quantized embedding vectors for faster and resource-efficient retrieval.

This paper bridges the gap between computational linguistics and misinformation control through a novel application of LLMs for translation and retrieval tasks, providing a framework with potential adaptability across various IR domains. Its approach may inspire subsequent research endeavors aiming for improved retrieval mechanisms that minimize semantic loss while efficiently addressing multilingual IR challenges.