- The paper presents nomic-embed-text-v1, a novel, reproducible text embedding model extending context to 8192 tokens while outperforming contemporary models.
- It employs a multi-stage training process including MLM, unsupervised contrastive pretraining, and supervised fine-tuning with datasets like BooksCorpus, Wikipedia, and MSMarco.
- Architectural innovations such as rotary embeddings, Flash Attention, and SwiGLU enable a compact 137M parameter design to achieve state-of-the-art performance.
Nomic Embed: Training a Reproducible Long Context Text Embedder
Introduction
The paper presents nomic-embed-text-v1, a novel text embedding model with an extended sequence length of 8192 tokens, which surpasses contemporary OpenAI text embedding models like text-embedding-ada-002 and text-embedding-3-small on tasks involving both short and long contexts. The distinct characteristic of the model lies in its open-source, reproducible nature, which includes openly shared model weights, training datasets, and code, thereby achieving end-to-end auditability. The model is conducive to various NLP applications, especially in scenarios where extended context processing is critical.
Prior to this work, the majority of state-of-the-art text embedding models were limited to a maximum context length of 512 tokens, and long-context capabilities were largely restricted to closed-source models. Notable long-context models such as E5-Mistral-7b-instruct are impeded by significant computational resource requirements. Contrastingly, nomic-embed-text-v1 not only elevates the context length significantly but also remains computationally feasible due to its optimized architecture. This enables its deployment across a wider range of applications without the associated burdens of large parameter models.
Training Methodology
The model's training comprises three fundamental stages: pretraining with masked language modeling (MLM), unsupervised contrastive pretraining, and supervised contrastive fine-tuning. In the MLM phase, data from BooksCorpus and Wikipedia are employed to adapt a long-context version of BERT. Unsupervised contrastive pretraining follows with a dataset curated to 235 million pairs, crucial for learning diverse semantic representation. Performance in downstream tasks is further enhanced through supervised contrastive fine-tuning on datasets such as MSMarco, NQ, and others, which are crucial for high-quality semantic representations.
Model Architecture
The model adopts significant modifications to support long context lengths. Rotary positional embeddings replace absolute positional embeddings, enabling better pattern recognition over extended sequences. Flash Attention optimizes the attention mechanism allowing efficient processing of longer sequences. Additionally, the SwiGLU activation function and dynamic NTK interpolation are incorporated to facilitate scalability up to an 8192 sequence length during inference. These architectural modifications result in a compact 137M parameter model that effectively processes extended contexts, marking a notable advancement in the design of embedding models.
Experimental Evaluation
nomic-embed-text-v1 exhibits superior performance across several benchmarks including MTEB, LoCo, and Jina's Long Context Evaluation. It decisively outperforms text-embedding-ada-002 on both short and long-context evaluations, showcasing its robustness and versatility. Benchmarks indicate its superiority in diverse applications, markedly improving information retrieval, clustering, and reranking tasks. The performance on long-context benchmarks reiterates its aptitude in applications requiring expansive contextual comprehension.
Future Directions and Implications
The open-source and reproducible nature of nomic-embed-text-v1 underscores a transformative shift towards model transparency and reliability in NLP. It lays a foundation for future innovations in embedding models, immensely facilitating auditing and compliance, particularly in high-stakes industry deployments. Future research may explore further scaling of such models while enhancing computational efficiency and exploring additional applications that benefit from extended context processing.
Conclusion
Nomic-embed-text-v1 represents a significant contribution to the domain of text embeddings. By addressing both the limitations of context length and resource-intensive computational requirements, it sets a new benchmark for open, reproducible models in NLP, promoting broader accessibility and adaptability across diverse machine learning applications. The seamless integration of open-access weights, code, and datasets serves to empower researchers and practitioners aiming to replicate and extend its capabilities.