Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving Cross-lingual Information Retrieval on Low-Resource Languages via Optimal Transport Distillation (2301.12566v1)

Published 29 Jan 2023 in cs.CL and cs.IR

Abstract: Benefiting from transformer-based pre-trained LLMs, neural ranking models have made significant progress. More recently, the advent of multilingual pre-trained LLMs provides great support for designing neural cross-lingual retrieval models. However, due to unbalanced pre-training data in different languages, multilingual LLMs have already shown a performance gap between high and low-resource languages in many downstream tasks. And cross-lingual retrieval models built on such pre-trained models can inherit language bias, leading to suboptimal result for low-resource languages. Moreover, unlike the English-to-English retrieval task, where large-scale training collections for document ranking such as MS MARCO are available, the lack of cross-lingual retrieval data for low-resource language makes it more challenging for training cross-lingual retrieval models. In this work, we propose OPTICAL: Optimal Transport distillation for low-resource Cross-lingual information retrieval. To transfer a model from high to low resource languages, OPTICAL forms the cross-lingual token alignment task as an optimal transport problem to learn from a well-trained monolingual retrieval model. By separating the cross-lingual knowledge from knowledge of query document matching, OPTICAL only needs bitext data for distillation training, which is more feasible for low-resource languages. Experimental results show that, with minimal training data, OPTICAL significantly outperforms strong baselines on low-resource languages, including neural machine translation.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Zhiqi Huang (78 papers)
  2. Puxuan Yu (7 papers)
  3. James Allan (28 papers)
Citations (20)