Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 70 tok/s

Gemini 2.5 Pro 55 tok/s Pro

GPT-5 Medium 14 tok/s Pro

GPT-5 High 14 tok/s Pro

GPT-4o 72 tok/s Pro

Kimi K2 191 tok/s Pro

GPT OSS 120B 449 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval (2108.08787v2)

Published 19 Aug 2021 in cs.CL and cs.IR

Abstract: We present Mr. TyDi, a multi-lingual benchmark dataset for mono-lingual retrieval in eleven typologically diverse languages, designed to evaluate ranking with learned dense representations. The goal of this resource is to spur research in dense retrieval techniques in non-English languages, motivated by recent observations that existing techniques for representation learning perform poorly when applied to out-of-distribution data. As a starting point, we provide zero-shot baselines for this new dataset based on a multi-lingual adaptation of DPR that we call "mDPR". Experiments show that although the effectiveness of mDPR is much lower than BM25, dense representations nevertheless appear to provide valuable relevance signals, improving BM25 results in sparse-dense hybrids. In addition to analyses of our results, we also discuss future challenges and present a research agenda in multi-lingual dense retrieval. Mr. TyDi can be downloaded at https://github.com/castorini/mr.tydi.

Citations (99)

View on Semantic Scholar

Summary

The paper presents a multi-lingual benchmark leveraging Wikipedia to evaluate the robustness of dense retrieval models across eleven diverse languages.
It compares traditional BM25 and the dense retrieval mDPR, revealing mDPR's limitations in zero-shot cross-lingual scenarios.
The study highlights that hybrid models combining sparse and dense signals significantly boost retrieval metrics like MRR and Recall@100.

Multi-lingual Benchmark for Dense Retrieval: An Analytical Overview

The paper, authored by Xinyu Zhang, Xueguang Ma, Peng Shi, and Jimmy Lin, introduces a benchmark dataset designed to facilitate research on mono-lingual retrieval across eleven typologically diverse languages using dense representations. The focus lies in evaluating the robustness and generalizability of dense retrieval models, especially given the increasingly observed limitations of such techniques when applied to out-of-distribution data.

Dense retrieval models, particularly those based on transformer architectures, have demonstrated promise in English-centric settings but face challenges in non-English contexts. This work addresses the gap by presenting a dataset grounded in document-level retrieval tasks across multiple languages. This is especially pertinent given that these models typically rely on supervised learning to train bi-encoder architectures, which may falter with "out-of-distribution" text inputs — a non-trivial consideration in a multi-lingual world.

Dataset Construction and Baselines

The core contribution of this paper is the creation of a multi-lingual retrieval benchmark. Utilizing Wikipedia as a corpus, the authors provide a comprehensive framework to assess retrieval capabilities across languages such as Arabic, Bengali, Finnish, and Japanese, among others. The benchmark, referred to simply as "Mr. TyDi", draws from the framework of a prior QA dataset, TyDi QA, to establish passage-level relevance judgments, thereby framing an open-retrieval task.

To validate the utility of the dataset, the authors present two baseline retrieval methods. The first is a traditional BM25 model, known for its robust performance irrespective of language. The second, mDPR, represents a dense retrieval adaptation using multi-lingual BERT, providing a zero-shot baseline to gauge performance when models trained in one language are tested across other languages. Although mDPR shows limitations in recall and rank compared to BM25, its incorporation in sparse-dense hybrid models illustrates the potential for complementary relevance signals, highlighting the nuanced interplay between dense vector representations and traditional term-based methods.

Results and Discussion

The results vividly underscore the complexities involved with zero-shot dense retrieval, revealing that such models often deliver subpar performance relative to the classic BM25 in a majority of linguistic contexts. The empirical evidence suggests that while dense retrievals models like mDPR possess valuable signal properties, they are inherently less robust in zero-shot cross-lingual scenarios. Nevertheless, when combined with BM25 in a hybrid model, they can enhance retrieval performance, as seen by significant gains in Mean Reciprocal Rank (MRR) and Recall@100 across most languages, save for Swahili and Telugu.

Implications and Future Direction

The implications of this research for both the theoretical and practical realms of information retrieval are profound. It prompts a reevaluation of current dense retrieval paradigms, urging the exploration of strategies for improved cross-lingual generalizability. The hybrid results elucidate potential pathways for advancing retrieval models by leveraging the synergy between sparse and dense representations.

Looking forward, the dataset presented in this paper provides a foundational basis for probing the impact of dense retrieval architectures and their training methodologies. With emerging interest in multi-step and cross-lingual fine-tuning strategies, there is a fertile ground for further investigation into optimizing dense retrieval models' transfer capabilities across diverse linguistic landscapes. This also opens doors to refined evaluation methodologies and stronger retrieval systems, ultimately contributing to a more inclusive global access to information technology advancements.

In conclusion, this paper makes a significant contribution toward understanding and improving dense retrieval techniques for non-English languages, offering a critical resource to benchmark and refine retrieval systems on a global scale.