ColBERT-XM: A Modular Multi-Vector Representation Model for Zero-Shot Multilingual Information Retrieval (2402.15059v1)
Abstract: State-of-the-art neural retrievers predominantly focus on high-resource languages like English, which impedes their adoption in retrieval scenarios involving other languages. Current approaches circumvent the lack of high-quality labeled data in non-English languages by leveraging multilingual pretrained LLMs capable of cross-lingual transfer. However, these models require substantial task-specific fine-tuning across multiple languages, often perform poorly in languages with minimal representation in the pretraining corpus, and struggle to incorporate new languages after the pretraining phase. In this work, we present a novel modular dense retrieval model that learns from the rich data of a single high-resource language and effectively zero-shot transfers to a wide array of languages, thereby eliminating the need for language-specific labeled data. Our model, ColBERT-XM, demonstrates competitive performance against existing state-of-the-art multilingual retrievers trained on more extensive datasets in various languages. Further analysis reveals that our modular approach is highly data-efficient, effectively adapts to out-of-distribution data, and significantly reduces energy consumption and carbon emissions. By demonstrating its proficiency in zero-shot scenarios, ColBERT-XM marks a shift towards more sustainable and inclusive retrieval systems, enabling effective information accessibility in numerous languages. We publicly release our code and models for the community.
- Explainable information retrieval. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3448–3451. ACM.
- MAD-G: multilingual adapter generation for efficient cross-lingual transfer. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4762–4781. ACL.
- Ankur Bapna and Orhan Firat. 2019. Simple, scalable adaptation for neural machine translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 1538–1548. ACL.
- Lukas Biewald. 2020. Experiment tracking with weights and biases. Software available from wandb.com.
- mmarco: A multilingual version of MS MARCO passage ranking dataset. CoRR, abs/2108.13897.
- When is multilinguality a curse? language modeling for 250 high- and low-resource languages. CoRR, abs/2311.09205.
- Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. CoRR, abs/2402.03216.
- Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451. ACL.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186. ACL.
- SPLADE: sparse lexical and expansion model for first stage ranking. In Proceedings of the 44th International ACM SIGIR conference on research and development in Information Retrieval, pages 2288–2292. ACM.
- End-to-end retrieval in continuous space. CoRR, abs/1811.08008.
- On the effectiveness of adapter-based tuning for pretrained language model adaptation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 2208–2222. ACL.
- Parameter-efficient transfer learning for NLP. In Proceedings of the 36th International Conference on Machine Learning, pages 2790–2799. PMLR.
- David A. Hull and Gregory Grefenstette. 1996. Querying across languages: A dictionary-based approach to multilingual information retrieval. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 49–57. ACM.
- Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547.
- Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 6769–6781. ACL.
- Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over BERT. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48. ACM.
- Quantifying the carbon emissions of machine learning. CoRR, abs/1910.09700.
- Neural approaches to multilingual information retrieval. In Proceedings of the 45th European Conference on Information Retrieval, pages 521–536. Springer.
- Rethinking the role of token retrieval in multi-vector retrieval. CoRR, abs/2304.01982.
- Jimmy Lin and Xueguang Ma. 2021. A few brief notes on deepimpact, coil, and a conceptual framework for information retrieval techniques. CoRR, abs/2106.14807.
- Pretrained Transformers for Text Ranking: BERT and Beyond. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers.
- Parameter-efficient neural reranking for cross-lingual and multilingual retrieval. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1071–1082. International Committee on Computational Linguistics.
- Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. In Proceedings of the 7th International Conference on Learning Representations.
- Teaching a new dog old tricks: Resurrecting multilingual retrieval using zero-shot learning. In Proceedings of the 42nd European Conference on Information Retrieval, pages 246–254. Springer.
- MTEB: massive text embedding benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2006–2029. ACL.
- MS MARCO: A human generated machine reading comprehension dataset. CoRR, abs/1611.09268v3.
- Document ranking with a pretrained sequence-to-sequence model. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 708–718. ACL.
- Multi-stage document ranking with BERT. CoRR, abs/1910.14424.
- Lifting the curse of multilinguality by pre-training modular transformers. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3479–3495. ACL.
- MAD-X: an adapter-based framework for multi-task cross-lingual transfer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 7654–7673. ACL.
- Learning multiple visual domains with residual adapters. Advances in Neural Information Processing Systems, 30:506–516.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 3980–3990. ACL.
- Okapi at TREC-3. In Proceedings of the 3rd Text REtrieval Conference, volume 500-225 of NIST Special Publication, pages 109–126. National Institute of Standards and Technology.
- Colbertv2: Effective and efficient retrieval via lightweight late interaction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3715–3734. ACL.
- Bite-rex: An explainable bilingual text retrieval system in the automotive domain. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3251–3255. ACM.
- Leveraging llms for synthesizing training data across many languages in multilingual dense retrieval. CoRR, abs/2311.05800.
- Text embeddings by weakly-supervised contrastive pre-training. CoRR, abs/2212.03533.
- Minilmv2: Multi-head self-attention relation distillation for compressing pretrained transformers. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, pages 2140–2151. ACL.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45. Association for Computational Linguistics.
- Shijie Wu and Mark Dredze. 2020. Are all languages created equal in multilingual bert? In Proceedings of the 5th Workshop on Representation Learning for NLP, pages 120–130. ACL.
- C-pack: Packaged resources to advance general chinese embedding. CoRR, abs/2309.07597.
- Approximate nearest neighbor negative contrastive learning for dense text retrieval. In Proceedings of the 9th International Conference on Learning Representations. OpenReview.net.
- mt5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498. ACL.
- C3: continued pretraining with contrastive weak supervision for cross language ad-hoc retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2507–2512. ACM.
- Parameter-efficient zero-shot transfer for cross-language dense retrieval with adapters. CoRR, abs/2212.10448.
- Mr. tydi: A multi-lingual benchmark for dense retrieval. CoRR, abs/2108.08787.
- Making a MIRACL: multilingual information retrieval across a continuum of languages. CoRR, abs/2210.09984.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.