Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

mMARCO: A Multilingual Version of the MS MARCO Passage Ranking Dataset (2108.13897v5)

Published 31 Aug 2021 in cs.CL and cs.AI

Abstract: The MS MARCO ranking dataset has been widely used for training deep learning models for IR tasks, achieving considerable effectiveness on diverse zero-shot scenarios. However, this type of resource is scarce in languages other than English. In this work, we present mMARCO, a multilingual version of the MS MARCO passage ranking dataset comprising 13 languages that was created using machine translation. We evaluated mMARCO by finetuning monolingual and multilingual reranking models, as well as a multilingual dense retrieval model on this dataset. We also evaluated models finetuned using the mMARCO dataset in a zero-shot scenario on Mr. TyDi dataset, demonstrating that multilingual models finetuned on our translated dataset achieve superior effectiveness to models finetuned on the original English version alone. Our experiments also show that a distilled multilingual reranker is competitive with non-distilled models while having 5.4 times fewer parameters. Lastly, we show a positive correlation between translation quality and retrieval effectiveness, providing evidence that improvements in translation methods might lead to improvements in multilingual information retrieval. The translated datasets and finetuned models are available at https://github.com/unicamp-dl/mMARCO.

mMARCO: A Multilingual Extension of MS MARCO Passage Ranking Dataset

The paper "mMARCO: A Multilingual Version of the MS MARCO Passage Ranking Dataset" explores the creation and evaluation of a multilingual Information Retrieval (IR) dataset that extends the widely used MS MARCO dataset from English to 13 languages. The authors address a critical gap in IR resources by focusing on non-English datasets, leveraging machine translation methods to broaden dataset accessibility for monolingual linguistic communities. They highlight how mMARCO can be pivotal for training state-of-the-art multilingual and monolingual IR models and investigating novel architectures in diverse languages.

Dataset Translation and Evaluation

To generate mMARCO, the authors utilized two translation methods: open-source models from the University of Helsinki and commercial solutions via Google Translate. These translations target 13 languages, considering factors such as speaker population and language resources availability. The transposed dataset includes about 532,761 query-passage pairs per language.

The evaluation of mMARCO involved fine-tuning monolingual and multilingual models and benchmarking these models in zero-shot scenarios using the Mr. TyDi dataset. The effectiveness of models fine-tuned on mMARCO was measured against the original English-only training, with results indicating superior performance in multiple languages. Remarkably, the trained multilingual models demonstrated robust zero-shot capabilities across languages when evaluated on Mr. TyDi.

Experimental Findings

The findings reveal the capability of multilingual models fine-tuned on mMARCO to outperform models trained solely on English datasets, reinforcing the potential of multilingual datasets in advancing IR systems. Moreover, evaluation outcomes showed that mMiniLM, a distilled model with significantly reduced parameters, performed competitively against larger models, suggesting efficiency without sacrificing accuracy.

A correlation analysis revealed that higher translation quality provides marginally better retrieval effectiveness, implying that advances in translation models could enhance multilingual IR. Nonetheless, even with existing translations, the results achieved underline the utility of mMARCO in fostering more inclusive IR systems.

Implications and Future Directions

Practically, mMARCO presents a significant contribution to the IR community by providing a diverse and accessible dataset that can be utilized to train more inclusive models accommodating multilingual contexts. Theoretically, it nurtures discussions on model architectures that can effectively handle translated data and perform in zero-shot settings.

The extensive evaluation and the substantial dataset contribution hold promise for future developments in AI, particularly in the domain of multilingual IR, by providing a platform for more inclusive and widely applicable IR systems development. Future work could extend the languages covered by mMARCO, or focus on developing better translation models tailored specifically for IR tasks.

In conclusion, mMARCO sets an important precedent in multilingual dataset creation and utilization, offering a significant stepping stone toward more accessible and effective IR systems in various languages. Its contribution extends beyond merely translating the MS MARCO dataset, laying foundational work for optimizing IR systems to operate effectively across linguistic barriers.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Luiz Bonifacio (9 papers)
  2. Vitor Jeronymo (11 papers)
  3. Hugo Queiroz Abonizio (1 paper)
  4. Israel Campiotti (4 papers)
  5. Marzieh Fadaee (40 papers)
  6. Roberto Lotufo (41 papers)
  7. Rodrigo Nogueira (70 papers)
Citations (96)
Github Logo Streamline Icon: https://streamlinehq.com