Arctic-Embed 2.0: Multilingual Retrieval Without Compromise (2412.04506v2)

Published 3 Dec 2024 in cs.CL, cs.IR, and cs.LG

Abstract: This paper presents the training methodology of Arctic-Embed 2.0, a set of open-source text embedding models built for accurate and efficient multilingual retrieval. While prior works have suffered from degraded English retrieval quality, Arctic-Embed 2.0 delivers competitive retrieval quality on multilingual and English-only benchmarks, and supports Matryoshka Representation Learning (MRL) for efficient embedding storage with significantly lower compressed quality degradation compared to alternatives. We detail the design and implementation, presenting several important open research questions that arose during model development. We conduct experiments exploring these research questions and include extensive discussion aimed at fostering further discussion in this field.

PDF HTML Abstract

Arctic-Embed 2.0: Multilingual Retrieval Without Compromise

The paper "Arctic-Embed 2.0: Multilingual Retrieval Without Compromise" introduces Arctic-Embed 2.0, an advanced suite of text embedding models optimized for multilingual retrieval tasks. This work builds upon prior limitations in multilingual retrieval effectiveness, particularly the challenge of English retrieval performance degradation when extending capabilities to multilingual contexts.

The primary contribution of Arctic-Embed 2.0 is a robust training methodology applied to two model sizes, targeting an equilibrium between efficiency and effectiveness. These models deliver competitive retrieval quality on both multilingual and English-only benchmarks, such as the MTEB Retrieval and CLEF, providing a solution that addresses common trade-offs observed in previous models. The integration of Matryoshka Representation Learning (MRL) significantly reduces quality degradation during compression, displaying a superior alternative for dimensionality reduction in multilingual embeddings.

Key Findings and Methodology

The research identifies two primary limitations in previous multilingual models: compromised English retrieval quality and inefficiency due to large model sizes. Arctic-Embed 2.0 mitigates these issues by employing a three-stage training framework—pretraining, contrastive pretraining, and contrastive finetuning. Notably, the paper questions the assumption that pretraining on distant languages degrades English retrieval performance, using empirical evidence to contest this hypothesis and suggests alternative explanations.

The training process leverages a sophisticated approach to data selection and filtering, which includes large-scale contrastive pretraining with diverse multilingual datasets such as mC4 and CC News. An innovative use of MRL encourages significant embedding compression, ensuring the models maintain retrieval performance even when the embeddings are dimensionally reduced.

Experimental Results

Comprehensive evaluations were conducted on established multilingual retrieval benchmarks—MTEB Retrieval, CLEF, and MIRACL. The models, Arctic-Embed 2.0-M and Arctic-Embed 2.0-L, achieved top-tier performance, surpassing leading alternatives and achieving retrieval scores with minimal performance compromise upon embedding compression. Notably, Arctic-Embed 2.0's medium-sized model outperformed Google's text-embedding-004 on the MTEB-R metric, a notable result considering the models' size differential.

Implications and Future Directions

The paper boldly navigates unexplored research questions and identifies room for further investigation, particularly concerning negative cross-lingual transfer effects observed in specific language families and the influence of pretraining data on English retrieval performance. These findings underline the intricacies in cross-lingual transfer processes and challenge prevailing perspectives on multilingual model pretraining effects.

Practically, Arctic-Embed 2.0's contributions offer a viable solution for enterprises and research domains where multilingual retrieval capabilities are crucial yet often hampered by quality and efficiency constraints. The models demonstrate exemplary performance in scenarios necessitating reduced memory footprints without sacrificing retrieval quality, making Arctic-Embed 2.0 particularly suitable for deployment in computational resource-sensitive environments.

The results and methodology set a foundation for future developments in scalable multilingual retrieval systems. Further refinement of hard negative mining and contrastive pretraining processes could yield even more efficient and effective models. Additionally, expanding the model's linguistic coverage, while ensuring quality data integration, remains a promising area for exploration.

In summary, "Arctic-Embed 2.0: Multilingual Retrieval Without Compromise" significantly advances the field of multilingual information retrieval by tackling prominent limitations with methodical precision and innovative strategies. The work stands to inform both theoretical understanding and practical applications, paving the way for more nuanced and capable multilingual embedding models.