Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models (2405.05374v1)

Published 8 May 2024 in cs.CL, cs.AI, and cs.IR

Abstract: This report describes the training dataset creation and recipe behind the family of \texttt{arctic-embed} text embedding models (a set of five models ranging from 22 to 334 million parameters with weights open-sourced under an Apache-2 license). At the time of their release, each model achieved state-of-the-art retrieval accuracy for models of their size on the MTEB Retrieval leaderboard, with the largest model, arctic-embed-l outperforming closed source embedding models such as Cohere's embed-v3 and Open AI's text-embed-3-large. In addition to the details of our training recipe, we have provided several informative ablation studies, which we believe are the cause of our model performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Ms marco: A human generated machine reading comprehension dataset. ArXiv, abs/1611.09268.
  2. Disco-clip: A distributed contrastive loss for memory efficient clip training.
  3. Together Computer. 2023. Redpajama: an open dataset for training large language models.
  4. Promptagator: Few-shot dense retrieval from 8 examples.
  5. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  6. Bert: Pre-training of deep bidirectional transformers for language understanding.
  7. The faiss library.
  8. Jina embeddings 2: 8192-token general-purpose text embeddings for long documents.
  9. Dimensionality reduction by learning an invariant mapping. 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), 2:1735–1742.
  10. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
  11. Gecko: Versatile text embeddings distilled from large language models.
  12. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459–9474. Curran Associates Inc.
  13. Xianming Li and Jing Li. 2023. Angle-optimized text embeddings.
  14. Towards general text embeddings with multi-stage contrastive learning. ArXiv, abs/2308.03281.
  15. Towards general text embeddings with multi-stage contrastive learning.
  16. Pretrained transformers for text ranking: Bert and beyond. Proceedings of the 14th ACM International Conference on Web Search and Data Mining.
  17. Jonas W. Mueller and Aditya Thyagarajan. 2016. Siamese recurrent architectures for learning sentence similarity. In AAAI Conference on Artificial Intelligence.
  18. Generative representational instruction tuning.
  19. Mteb: Massive text embedding benchmark.
  20. Nomic embed: Training a reproducible long context text embedder.
  21. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only.
  22. Rocketqa: An optimized training approach to dense passage retrieval for open-domain question answering.
  23. Scaling language models: Methods, analysis insights from training gopher.
  24. Exploring the limits of transfer learning with a unified text-to-text transformer.
  25. In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics, 11:1316–1331.
  26. Benchmarking and building long-context retrieval models with loco and m2-bert.
  27. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  28. Representation learning with contrastive predictive coding. ArXiv, abs/1807.03748.
  29. Attention is all you need. ArXiv, abs/1706.03762.
  30. Text embeddings by weakly-supervised contrastive pre-training. ArXiv, abs/2212.03533.
  31. Text embeddings by weakly-supervised contrastive pre-training.
  32. Minilmv2: Multi-head self-attention relation distillation for compressing pretrained transformers.
  33. C-pack: Packaged resources to advance general chinese embedding.
  34. Approximate nearest neighbor negative contrastive learning for dense text retrieval. ArXiv, abs/2007.00808.
  35. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Conference on Empirical Methods in Natural Language Processing.
  36. Rui Meng Ye Liu Shafiq Rayhan Joty Caiming Xiong Yingbo Zhou Semih Yavuz. 2024. Sfr-embedding-mistral:enhance text retrieval with transfer learning. Salesforce AI Research Blog.
  37. Rankt5: Fine-tuning t5 for text ranking with ranking losses.
Citations (16)

Summary

  • The paper demonstrates Arctic-embed models achieving superior retrieval accuracy and outperforming closed-source systems on benchmark datasets.
  • The paper details a stratified training strategy that leverages diverse data sources and optimized pretraining techniques to enhance performance.
  • The paper presents a scalable architecture with variants ranging from 23 to 334 million parameters, addressing varied computational constraints.

Exploring the Arctic-embed Models: Modern Advances in Text Embedding

Introduction to Arctic-embed Models

The landscape of text embedding models is rapidly evolving, and the recent introduction of the Arctic-embed models offers a vivid snapshot of this dynamic field. The Arctic-embed models are a range of text embedding models which vary by size but share a common lineage: they are all trained using the same methodology but differ in the scale of parameters they employ, ranging from 22 to 334 million parameters. What sets these models apart in the crowded field of text embeddings is their excellent retrieval accuracy that outperforms many closed-source competitors on standardized benchmarks like the MTEB Retrieval leaderboard.

Training and Model Specifications

Model Sizes and Architecture

Arctic-embed models are all encoder-only architectures, akin to BERT, developed in various sizes. These range from the smallest 'xs' model with 23 million parameters to the largest 'l' model with 334 million parameters. The detailed breakdown includes:

  • xs: Utilizes a MiniLMv2 architecture, typically lighter and faster, suitable for resource-constrained environments.
  • s and m: These middle-tier models balance between compute efficiency and performance, making them versatile for many practical applications.
  • m-long and l: Aimed at tackling more demanding retrieval tasks where more significant computational overhead is permissible for better accuracy.

Each model variant achieves state-of-the-art performance for its size class, making the Arctic-embed suite potentially very influential in both academic research and practical applications.

Training Data and Techniques

The Arctic-embed models benefit from meticulous attention to the training data's quality and diversity. Leveraging a blend of web search data, high-quality web data, and synthetic data, the training regimen ensures that the models are exposed to a broad spectrum of language uses and contexts. One innovative aspect of their training involves the use of a "stratified" approach wherein each minibatch of data comes from a single source. This method, along with other advanced techniques such as longer pretraining sequence lengths and tuning specific to retrieval tasks, appears to contribute significantly to the models' success.

Practical Implications and Theoretical Insights

Retrieval Performance

With text embedding models finding utilities in search systems and various NLP applications, the effectiveness of these embeddings in accurately retrieving information is paramount. Here, Arctic-embed models shine brightly, having demonstrated superior performance in benchmark assessments. Especially notable is the model's ability to handle different data scale requirements efficiently, making it a strong candidate for scalable search solutions.

Future Prospects and Speculations

Given the open-source nature of these models, there is considerable potential for widespread adoption and further community-driven enhancements. Also, the tunable aspects of the training, like the negative mining strategy and the use of synthetic data for improving sampling efficiency, are areas ripe for future research. Potentially, we might see more specialized versions of these models, adapting or optimizing these for distinct types of text data or particular languages.

Conclusions

The Arctic-embed models represent a significant step forward in the development of effective and scalable text embedding models. Through a combination of innovative data handling techniques and robust training strategies, these models achieve excellent performance metrics, suggesting their utility in a wide range of applications spanning from simple retrieval tasks to complex NLP workflows. As these models are further studied, adapted, and perhaps even improved upon by the open-source and AI research communities, we can expect them to solidify their place as essential tools in the text analysis arsenal.

Youtube Logo Streamline Icon: https://streamlinehq.com