Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

184 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

45 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

314 2 539

Nomic Embed: Training a Reproducible Long Context Text Embedder (2402.01613v2)

Published 2 Feb 2024 in cs.CL and cs.AI

Abstract: This technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on the short-context MTEB benchmark and the long context LoCo benchmark. We release the training code and model weights under an Apache 2.0 license. In contrast with other open-source models, we release the full curated training data and code that allows for full replication of nomic-embed-text-v1. You can find code and data to replicate the model at https://github.com/nomic-ai/contrastors.

Citations (58)

View on Semantic Scholar

Summary

The paper introduces a 137M parameter model that efficiently processes up to 8192 tokens, significantly extending context lengths beyond traditional models.
It employs a contrastive loss training approach with a pre-trained transformer and ensures full reproducibility by releasing weights, code, and 235 million text pairs.
Benchmarking results of 62.39 on MTEB and 85.53 on LoCo demonstrate its superiority for semantic search and data visualization over existing OpenAI and open-source models.

Introduction

The technical report under scrutiny unveils "nomic-embed-text-v1," a pioneering 137M parameter, long-context text embedding model, marking a notable advancement in the domain of NLP. Unlike preceding models which are generally closed-source and tend to perform inadequately with extended context lengths, this model caters to both short and long-context tasks with efficacy surpassing that of existing OpenAI models like text-embedding-ada-002 and text-embedding-3-small.

Model and Training Approach

The efficacy of text embedding models is gauged through performance on tasks that require an understanding of document-level content rather than just isolated sentences or chunks. Nomic-embed-text-v1 excels in handling up to 8192 tokens, a significant leap from the standard 512 tokens catered to by existing open-source models. The training methodology is noteworthy, involving a contrastive loss objective starting with a pre-trained transformer foundation. The report also commendably provides full end-to-end reproducibility by openly sharing not just the model weights and code but also the curated training data loader encompassing 235 million text pairs.

Benchmarking Results

In quantitative terms, the model's performance is remarkable. On the MTEB benchmark, it records a score of 62.39, and even more impressively, achieves an 85.53 score on the LoCo benchmark, both of which highlight its superiority over its counterparts. Such strong numerical results not only demonstrate the model's capabilities but also promise considerable practical utility in applications such as semantic search and data visualization.

Conclusion

The release of nomic-embed-text-v1 under an Apache 2 license heralds a new era of transparency and accessibility in the domain of generative AI models for long-context text embedding. The authors have made significant contributions by providing benchmarks that objectively assess model performance across a variety of tasks and context lengths. A deeper look into the specifics of their training data and approach could yield insights into creating even more efficient and effective models in the future. For a community increasingly focused on model auditability and compliance, such an open-source offering is both timely and crucial.

PDF Markdown

GitHub

GitHub - nomic-ai/contrastors: Train Models Contrastively in Pytorch (539 stars)

Tweets

https://twitter.com/_akhaliq/status/1754376201936781406

https://twitter.com/adityakusupati/status/1754405797617111438

https://twitter.com/nomic_ai/status/1754558309292560863

https://twitter.com/fly51fly/status/1754643416313397349

https://twitter.com/gm8xx8/status/1754325907693588946

https://twitter.com/knishimae0531/status/1754467518117482506

YouTube

Show All Videos