jina-embeddings-v3: Multilingual Embeddings With Task LoRA (2409.10173v3)

Published 16 Sep 2024 in cs.CL, cs.AI, and cs.IR

Abstract: We introduce jina-embeddings-v3, a novel text embedding model with 570 million parameters, achieves state-of-the-art performance on multilingual data and long-context retrieval tasks, supporting context lengths of up to 8192 tokens. The model includes a set of task-specific Low-Rank Adaptation (LoRA) adapters to generate high-quality embeddings for query-document retrieval, clustering, classification, and text matching. Evaluation on the MTEB benchmark shows that jina-embeddings-v3 outperforms the latest proprietary embeddings from OpenAI and Cohere on English tasks, while achieving superior performance compared to multilingual-e5-large-instruct across all multilingual tasks. With a default output dimension of 1024, users can flexibly reduce the embedding dimensions to as low as 32 without compromising performance, enabled by Matryoshka Representation Learning.

PDF HTML Abstract

Overview of "Jina Embeddings V3: Multilingual Embeddings With Task LoRA"

The paper "Jina Embeddings V3: Multilingual Embeddings With Task LoRA" presents a novel text embedding model featuring 570 million parameters, which is designed for multilingual and long-context retrieval tasks. This model, named https://huggingface.co/jinaai/jina-embeddings-v3, achieves state-of-the-art performance across multiple domains, incorporating various advanced techniques such as Low-Rank Adaptation (LoRA) adapters and Matryoshka Representation Learning (MRL).

The paper systematically addresses common issues faced by traditional embedding models, such as the need for fine-tuning for specific tasks and the inability to efficiently manage large models in real-world applications. The robust architecture and effective methods demonstrated in the paper underscore the model's practical utility in scenarios demanding cost-efficient yet high-performing solutions, particularly in embedding tasks involving multilingual data.

Key Contributions

1. Task-Specific Optimization Using LoRA:

The model integrates task-specific LoRA adapters, which significantly enhance performance in tasks such as query-document retrieval, clustering, classification, and text matching. Empirical results demonstrate that LoRA adapters outperform previous instruction-based approaches by generating higher quality embeddings specific to each task.

2. Addressing Retrieval Failures with Synthetic Data:

To enhance robustness, the model incorporates synthetic training data targeting common retrieval failures. This approach mitigates typical issues, such as misleading syntactic similarities and misinterpretation of named entities, thereby improving the model’s reliability in edge cases.

3. Integration of Advanced Techniques:

The model employs Matryoshka Representation Learning (MRL) which allows flexible truncation of embedding dimensions without degrading performance. This flexibility is crucial for applications with varying space and performance requirements. Additionally, techniques like FlashAttention 2 and RoPE are integrated, enhancing the model's ability to process sequences of up to 8192 tokens effectively.

Model Architecture

The backbone of https://huggingface.co/jinaai/jina-embeddings-v3 is adapted from XLM-RoBERTa, augmented to handle long text sequences, optimize task-specific embeddings, and improve efficiency. It includes several notable architectural modifications:

Long-Text Encodings: Utilizes Rotary Position Embeddings (RoPE) with adjustable base frequency for better handling long contexts, maintaining performance for sequences up to 8192 tokens.
Task-Specific LoRA Adapters: Embedding and linear layers within the attention mechanism are equipped with low-rank decomposition matrices, facilitating dynamic task-specific adaptation.
Efficiency and Scalability: Implements FlashAttention 2 and DeepSpeed framework to improve computational efficiency and scalability.

Training Methodology

The training consists of three major stages:

Pre-Training: The model undergoes standard masked LLMing (MLM) training on a multilingual corpus to establish foundational language understanding.
Fine-Tuning for Embedding Tasks: The model is fine-tuned on text pairs using a bi-directional InfoNCE loss to learn effective text sequence encoding into single vector representations.
Training Task-Specific Adapters: Separate LoRA adapters are trained for distinct tasks (retrieval, text matching, classification, and separation), using specialized datasets and loss functions to optimize performance for each specific task.

Evaluation and Results

Multilingual and English MTEB Performance

The evaluation on multilingual and English benchmarks, particularly the MTEB (Massive Text Embedding Benchmark), underscores the model’s robustness:

Monolingual Tasks: Outperforms proprietary models from OpenAI and Cohere on English tasks, showing substantial improvements in classification accuracy (82.58%) and sentence similarity (85.8%).
Multilingual Tasks: Excels in multilingual tasks, surpassing multilingual-e5-large-instruct in most benchmarks. Demonstrates a weighted average score of 64.44% across multiple languages.

LongEmbed Task Performance

In long document retrieval tasks (LongEmbed), https://huggingface.co/jinaai/jina-embeddings-v3 shows superior performance, achieving the highest average score compared to other models like text-embedding-3-large and bge-m3. This demonstrates its effective handling of long-context documents, facilitated by advanced encoding techniques like RoPE.

Addressing Retrieval Failures

The incorporation of synthetic data to address specific failure cases results in marked improvements across evaluated scenarios, as evidenced by the quantitative data in the evaluation studies. This targeted synthetic training helps mitigate issues that lead to lower retrieval precision and enhances overall robustness.

Future Implications and Speculations

https://huggingface.co/jinaai/jina-embeddings-v3 showcases a significant advancement in text embeddings for multilingual and contextually rich tasks. Its efficient architecture and advanced methodologies offer practical solutions for real-world applications, particularly in scenarios requiring scalable and cost-effective deployments.

Future developments in AI could further build on this foundation by exploring enhanced adapters for more granular task distinctions, optimizing embedding space utilization through advanced dimensionality reduction techniques, and incorporating continual learning mechanisms to keep the model updated with evolving language patterns and domain-specific knowledge.

Conclusion

The paper delivers a comprehensive solution to the common limitations of embedding models, illustrating the effectiveness of https://huggingface.co/jinaai/jina-embeddings-v3 with compelling empirical evidence. The innovations in task-specific optimizations and advanced embedding techniques position this model as a valuable asset for a wide range of NLP applications, setting a high standard for future research in multilingual embeddings and neural information retrieval.