Papers
Topics
Authors
Recent
Search
2000 character limit reached

Nomic Embed: Training a Reproducible Long Context Text Embedder

Published 2 Feb 2024 in cs.CL and cs.AI | (2402.01613v2)

Abstract: This technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on the short-context MTEB benchmark and the long context LoCo benchmark. We release the training code and model weights under an Apache 2.0 license. In contrast with other open-source models, we release the full curated training data and code that allows for full replication of nomic-embed-text-v1. You can find code and data to replicate the model at https://github.com/nomic-ai/contrastors.

Citations (58)

Summary

  • The paper presents nomic-embed-text-v1, a novel, reproducible text embedding model extending context to 8192 tokens while outperforming contemporary models.
  • It employs a multi-stage training process including MLM, unsupervised contrastive pretraining, and supervised fine-tuning with datasets like BooksCorpus, Wikipedia, and MSMarco.
  • Architectural innovations such as rotary embeddings, Flash Attention, and SwiGLU enable a compact 137M parameter design to achieve state-of-the-art performance.

Nomic Embed: Training a Reproducible Long Context Text Embedder

Introduction

The paper presents nomic-embed-text-v1, a novel text embedding model with an extended sequence length of 8192 tokens, which surpasses contemporary OpenAI text embedding models like text-embedding-ada-002 and text-embedding-3-small on tasks involving both short and long contexts. The distinct characteristic of the model lies in its open-source, reproducible nature, which includes openly shared model weights, training datasets, and code, thereby achieving end-to-end auditability. The model is conducive to various NLP applications, especially in scenarios where extended context processing is critical.

Prior to this work, the majority of state-of-the-art text embedding models were limited to a maximum context length of 512 tokens, and long-context capabilities were largely restricted to closed-source models. Notable long-context models such as E5-Mistral-7b-instruct are impeded by significant computational resource requirements. Contrastingly, nomic-embed-text-v1 not only elevates the context length significantly but also remains computationally feasible due to its optimized architecture. This enables its deployment across a wider range of applications without the associated burdens of large parameter models.

Training Methodology

The model's training comprises three fundamental stages: pretraining with masked language modeling (MLM), unsupervised contrastive pretraining, and supervised contrastive fine-tuning. In the MLM phase, data from BooksCorpus and Wikipedia are employed to adapt a long-context version of BERT. Unsupervised contrastive pretraining follows with a dataset curated to 235 million pairs, crucial for learning diverse semantic representation. Performance in downstream tasks is further enhanced through supervised contrastive fine-tuning on datasets such as MSMarco, NQ, and others, which are crucial for high-quality semantic representations.

Model Architecture

The model adopts significant modifications to support long context lengths. Rotary positional embeddings replace absolute positional embeddings, enabling better pattern recognition over extended sequences. Flash Attention optimizes the attention mechanism allowing efficient processing of longer sequences. Additionally, the SwiGLU activation function and dynamic NTK interpolation are incorporated to facilitate scalability up to an 8192 sequence length during inference. These architectural modifications result in a compact 137M parameter model that effectively processes extended contexts, marking a notable advancement in the design of embedding models.

Experimental Evaluation

nomic-embed-text-v1 exhibits superior performance across several benchmarks including MTEB, LoCo, and Jina's Long Context Evaluation. It decisively outperforms text-embedding-ada-002 on both short and long-context evaluations, showcasing its robustness and versatility. Benchmarks indicate its superiority in diverse applications, markedly improving information retrieval, clustering, and reranking tasks. The performance on long-context benchmarks reiterates its aptitude in applications requiring expansive contextual comprehension.

Future Directions and Implications

The open-source and reproducible nature of nomic-embed-text-v1 underscores a transformative shift towards model transparency and reliability in NLP. It lays a foundation for future innovations in embedding models, immensely facilitating auditing and compliance, particularly in high-stakes industry deployments. Future research may explore further scaling of such models while enhancing computational efficiency and exploring additional applications that benefit from extended context processing.

Conclusion

Nomic-embed-text-v1 represents a significant contribution to the domain of text embeddings. By addressing both the limitations of context length and resource-intensive computational requirements, it sets a new benchmark for open, reproducible models in NLP, promoting broader accessibility and adaptability across diverse machine learning applications. The seamless integration of open-access weights, code, and datasets serves to empower researchers and practitioners aiming to replicate and extend its capabilities.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 9 tweets with 314 likes about this paper.