Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Nomic Embed: Training a Reproducible Long Context Text Embedder (2402.01613v2)

Published 2 Feb 2024 in cs.CL and cs.AI

Abstract: This technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on the short-context MTEB benchmark and the long context LoCo benchmark. We release the training code and model weights under an Apache 2.0 license. In contrast with other open-source models, we release the full curated training data and code that allows for full replication of nomic-embed-text-v1. You can find code and data to replicate the model at https://github.com/nomic-ai/contrastors.

Citations (58)

Summary

  • The paper introduces a 137M parameter model that efficiently processes up to 8192 tokens, significantly extending context lengths beyond traditional models.
  • It employs a contrastive loss training approach with a pre-trained transformer and ensures full reproducibility by releasing weights, code, and 235 million text pairs.
  • Benchmarking results of 62.39 on MTEB and 85.53 on LoCo demonstrate its superiority for semantic search and data visualization over existing OpenAI and open-source models.

Introduction

The technical report under scrutiny unveils "nomic-embed-text-v1," a pioneering 137M parameter, long-context text embedding model, marking a notable advancement in the domain of NLP. Unlike preceding models which are generally closed-source and tend to perform inadequately with extended context lengths, this model caters to both short and long-context tasks with efficacy surpassing that of existing OpenAI models like text-embedding-ada-002 and text-embedding-3-small.

Model and Training Approach

The efficacy of text embedding models is gauged through performance on tasks that require an understanding of document-level content rather than just isolated sentences or chunks. Nomic-embed-text-v1 excels in handling up to 8192 tokens, a significant leap from the standard 512 tokens catered to by existing open-source models. The training methodology is noteworthy, involving a contrastive loss objective starting with a pre-trained transformer foundation. The report also commendably provides full end-to-end reproducibility by openly sharing not just the model weights and code but also the curated training data loader encompassing 235 million text pairs.

Benchmarking Results

In quantitative terms, the model's performance is remarkable. On the MTEB benchmark, it records a score of 62.39, and even more impressively, achieves an 85.53 score on the LoCo benchmark, both of which highlight its superiority over its counterparts. Such strong numerical results not only demonstrate the model's capabilities but also promise considerable practical utility in applications such as semantic search and data visualization.

Conclusion

The release of nomic-embed-text-v1 under an Apache 2 license heralds a new era of transparency and accessibility in the domain of generative AI models for long-context text embedding. The authors have made significant contributions by providing benchmarks that objectively assess model performance across a variety of tasks and context lengths. A deeper look into the specifics of their training data and approach could yield insights into creating even more efficient and effective models in the future. For a community increasingly focused on model auditability and compliance, such an open-source offering is both timely and crucial.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com