2000 character limit reached
Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models (2405.05374v1)
Published 8 May 2024 in cs.CL, cs.AI, and cs.IR
Abstract: This report describes the training dataset creation and recipe behind the family of \texttt{arctic-embed} text embedding models (a set of five models ranging from 22 to 334 million parameters with weights open-sourced under an Apache-2 license). At the time of their release, each model achieved state-of-the-art retrieval accuracy for models of their size on the MTEB Retrieval leaderboard, with the largest model, arctic-embed-l outperforming closed source embedding models such as Cohere's embed-v3 and Open AI's text-embed-3-large. In addition to the details of our training recipe, we have provided several informative ablation studies, which we believe are the cause of our model performance.
- Ms marco: A human generated machine reading comprehension dataset. ArXiv, abs/1611.09268.
- Disco-clip: A distributed contrastive loss for memory efficient clip training.
- Together Computer. 2023. Redpajama: an open dataset for training large language models.
- Promptagator: Few-shot dense retrieval from 8 examples.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Bert: Pre-training of deep bidirectional transformers for language understanding.
- The faiss library.
- Jina embeddings 2: 8192-token general-purpose text embeddings for long documents.
- Dimensionality reduction by learning an invariant mapping. 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), 2:1735–1742.
- Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
- Gecko: Versatile text embeddings distilled from large language models.
- Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459–9474. Curran Associates Inc.
- Xianming Li and Jing Li. 2023. Angle-optimized text embeddings.
- Towards general text embeddings with multi-stage contrastive learning. ArXiv, abs/2308.03281.
- Towards general text embeddings with multi-stage contrastive learning.
- Pretrained transformers for text ranking: Bert and beyond. Proceedings of the 14th ACM International Conference on Web Search and Data Mining.
- Jonas W. Mueller and Aditya Thyagarajan. 2016. Siamese recurrent architectures for learning sentence similarity. In AAAI Conference on Artificial Intelligence.
- Generative representational instruction tuning.
- Mteb: Massive text embedding benchmark.
- Nomic embed: Training a reproducible long context text embedder.
- The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only.
- Rocketqa: An optimized training approach to dense passage retrieval for open-domain question answering.
- Scaling language models: Methods, analysis insights from training gopher.
- Exploring the limits of transfer learning with a unified text-to-text transformer.
- In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics, 11:1316–1331.
- Benchmarking and building long-context retrieval models with loco and m2-bert.
- BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
- Representation learning with contrastive predictive coding. ArXiv, abs/1807.03748.
- Attention is all you need. ArXiv, abs/1706.03762.
- Text embeddings by weakly-supervised contrastive pre-training. ArXiv, abs/2212.03533.
- Text embeddings by weakly-supervised contrastive pre-training.
- Minilmv2: Multi-head self-attention relation distillation for compressing pretrained transformers.
- C-pack: Packaged resources to advance general chinese embedding.
- Approximate nearest neighbor negative contrastive learning for dense text retrieval. ArXiv, abs/2007.00808.
- Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Conference on Empirical Methods in Natural Language Processing.
- Rui Meng Ye Liu Shafiq Rayhan Joty Caiming Xiong Yingbo Zhou Semih Yavuz. 2024. Sfr-embedding-mistral:enhance text retrieval with transfer learning. Salesforce AI Research Blog.
- Rankt5: Fine-tuning t5 for text ranking with ranking losses.