Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 66 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 21 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 91 tok/s Pro
Kimi K2 202 tok/s Pro
GPT OSS 120B 468 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Nomic-Embed-Text-V1 Overview

Updated 30 September 2025
  • Nomic-Embed-Text-V1 is a long-context embedding model that employs a BERT-derived architecture with rotary positional embeddings and SwiGLU activations to encode up to 8192 tokens.
  • It integrates unsupervised contrastive pretraining with supervised fine-tuning, enhancing retrieval, clustering, and retrieval-augmented generation tasks.
  • Benchmark results show it outperforms both commercial and open models on short and long context tasks, with a fully reproducible open-source release.

Nomic-Embed-Text-V1 is a long-context, open-source English text embedding model built to provide scalable, efficient, and reproducible dense vector representations for natural language applications. Engineered on a BERT-derived base, it incorporates architectural and training modifications to support robust semantic encoding of up to 8192 tokens. The resulting embeddings facilitate advanced retrieval, clustering, and retrieval-augmented generation tasks across both short and long documents, outperforming contemporary commercial and open models on standard benchmarks.

1. Architectural Characteristics and Innovations

Nomic-Embed-Text-V1 employs a BERT base encoder tuned for long-range dependencies within a constrained parameter count of 137 million. Absolute positional embeddings are replaced by rotary positional embeddings, enabling generalization over contexts longer than the BERT-standard 512 tokens. SwiGLU activations paired with Flash Attention provide ∼25% faster runtime over GeLU, optimizing both training and inference. Dropout is entirely omitted, and vocabulary sizing is set to facilitate efficient tensor operations (multiples of 64). During deployment, Dynamic NTK interpolation lets the model scale inferences from its 2048-token training regime to an 8192-token upper bound without retraining (Nussbaum et al., 2 Feb 2024).

Component Description Innovation
Positional Encoding Rotary embeddings Generalizes to 8192 tokens
Activation SwiGLU 25% faster than GeLU
Attention Flash Attention Memory/performance optimized
Scaling Dynamic NTK interpolation at inference Trains at 2048, runs at 8192

This configuration allows both high-throughput and large-context embeddings while remaining computationally efficient for model size.

2. Training Regimen and Contrastive Objectives

The model's training sequence is staged as follows:

  1. Masked LLMing (MLM): Utilizes BooksCorpus and 2023 Wikipedia data, forming contiguous 2048-token blocks. The model is pretrained with a high masking ratio of 30% and omits next sentence prediction, increasing challenge and enabling deeper feature learning.
  2. Unsupervised Contrastive Pretraining: Approximately 470 million text pairs from broad web and domain sources are consistency-filtered by cosine similarity (computed via an auxiliary gte-base model), yielding ~235 million training pairs. InfoNCE loss governs representation learning:

LC=1niloges(qi,di)/τes(qi,di)/τ+jies(qi,dj)/τ\mathcal{L_C} = -\frac{1}{n}\sum_i \log \frac{e^{s(q_i, d_i)/\tau}}{e^{s(q_i, d_i)/\tau} + \sum_{j \neq i} e^{s(q_i, d_j)/\tau}}

where s(q,d)s(q, d) is cosine similarity and τ\tau the temperature parameter. Batches of 16,384 enable large in-batch negative sampling; GradCache and mixed precision mitigate memory overhead.

  1. Supervised Contrastive Fine-Tuning: Leveraging human-labeled data spanning retrieval (MSMarco, NQ), entailment (NLI), and specialized QA, inputs are prefixed with explicit task tags (e.g., "search_query," "classification") to allow for robust multitask performance.

This blend of unsupervised and task-driven supervision is designed to maximize both generalization and domain adaptation.

3. Benchmark Performance and Empirical Results

Performance is rigorously evaluated on both aggregated and granular retrieval/similarity tasks:

  • Short Context (MTEB): Outperforms OpenAI Ada-002 (60.99) and text-embedding-3-small (62.26) with an average score of 62.39.
  • Long Context: On Jina Long Context and LoCo, Nomic-Embed-Text-V1 achieves aggregate scores of 85.53 (Jina, 8192 tokens), outperforming jina-embeddings-base-v2 and OpenAI's models even at its smaller parameter scale.
Benchmark OpenAI Ada-002 OpenAI 3-small Nomic-Embed-Text-V1
MTEB (avg) 60.99 62.26 62.39
Jina Long Context <85.53 <85.53 85.53

This demonstrates competitive or superior retrieval and semantic clustering capability compared to commercial closed models within the same or larger size class.

4. Reproducibility and Open-Source Release

All aspects of the Nomic-Embed-Text-V1 pipeline are openly available under an Apache 2.0 license (Nussbaum et al., 2 Feb 2024):

  • Codebase: Full source for data processing, model training, and evaluation.
  • Model Weights: Final checkpoints for immediate deployment.
  • Curated Data Pipeline: Access to the 235 million pretraining pairs with a custom data loader.

This ensures that results are transparently reproducible and audit-friendly, addressing common concerns about closed proprietary embedding models.

5. Technical and Implementation Challenges

Several challenges were systematically addressed in the model's development:

  • Contextual and Memory Constraints: By replacing BERT’s absolute positional embeddings with rotary, and combining Flash Attention and Dynamic NTK interpolation, the architecture enables efficient scaling beyond 2048 tokens without retraining or excessive memory usage.
  • Batch Processing and Negative Sampling: 16,384-sample batches (for in-batch negatives) present extreme memory/memory-bandwidth demand; mitigated by GradCache, mixed-precision, and DeepSpeed Stage 2 optimizer strategies.
  • Task Disambiguation: Prefix tokens for each objective enforce separation of types (retrieval, clustering, QA), minimizing cross-task interference in the shared encoder.

6. Downstream Applications

Nomic-Embed-Text-V1 enables a wide range of advanced applications:

  • Retrieval-Augmented Generation (RAG): Capable of embedding and retrieving over very long input sequences, facilitating large-context augmentation of LLMs.
  • Semantic Search and Document Clustering: Supports analysis and grouping of lengthy documents (news articles, meeting transcripts, policy reports) by semantic similarity.
  • Classification and Multi-Document Analysis: Allows for robust document type assignment and summarization across domains needing high-throughput, context-aware representations.
  • Specialized Domains: Well suited to legal, scientific, and enterprise retrieval contexts where input length and semantic nuance matter.

7. Limitations and Future Directions

Areas for further improvement and active research include:

  • Scaling to Even Larger Contexts: Architectural changes or next-generation encoder extensions to enable even longer context lengths with minimal efficiency loss.
  • Domain Expansion: Incorporating additional or more heterogeneous corpus data to support non-English and more specialized domain retrieval.
  • Robustness and Safety: Developing audit tools and hybrid approaches that integrate retrieval-based verification with long-context embeddings for enhanced explainability and error mitigation.
  • Benchmarking in High-Stakes Scenarios: Extending empirical validation to legal, medical, and other safety-critical applications to rigorously assess model robustness and utility under practical constraints.
  • Handling Numeric and Fine-Grained Details: The model, like other LLM-based embeddings, currently exhibits limitations in encoding fine-grained numerical information (Deng et al., 6 Sep 2025). Approaches such as specialized numeral tokenization and numeracy-focused pretraining objectives are indicated directions for remedy.

Conclusion

Nomic-Embed-Text-V1 is a BERT-derived, 137-million-parameter English long-context text embedding model that sets a benchmark for open, reproducible, and high-throughput semantic representations. Its architectural adaptations, staged contrastive training, and fully open release position it as an efficient alternative to closed-source commercial models, delivering strong performance on both standard and long-context language benchmarks. The model's design and implementation directly address key challenges in scalable retrieval, multitask learning, and deployment efficiency, forming a foundation for ongoing research in large-context semantic embedding.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Nomic-Embed-Text-V1.