Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents (2310.19923v4)

Published 30 Oct 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Text embedding models have emerged as powerful tools for transforming sentences into fixed-sized feature vectors that encapsulate semantic information. While these models are essential for tasks like information retrieval, semantic clustering, and text re-ranking, most existing open-source models, especially those built on architectures like BERT, struggle to represent lengthy documents and often resort to truncation. One common approach to mitigate this challenge involves splitting documents into smaller paragraphs for embedding. However, this strategy results in a much larger set of vectors, consequently leading to increased memory consumption and computationally intensive vector searches with elevated latency. To address these challenges, we introduce Jina Embeddings 2, an open-source text embedding model capable of accommodating up to 8192 tokens. This model is designed to transcend the conventional 512-token limit and adeptly process long documents. Jina Embeddings 2 not only achieves state-of-the-art performance on a range of embedding-related tasks in the MTEB benchmark but also matches the performance of OpenAI's proprietary ada-002 model. Additionally, our experiments indicate that an extended context can enhance performance in tasks such as NarrativeQA.

References (36)

Citations (42)

View on Semantic Scholar

Summary

The paper presents a modified BERT with bidirectional ALiBi to extend the input length to 8192 tokens.
It employs a three-stage training process, including hard negatives, to optimize contrastive learning for diverse tasks.
Evaluations on benchmarks like GLUE and MTEB, plus new long-document tasks, confirm competitive and improved performance.

This paper introduces Jina Embeddings 2 (Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents, 2023), a new suite of open-source text embedding models designed to overcome the typical 512-token input length limitation of most existing models, particularly those based on BERT. This limitation necessitates splitting long documents for embedding, leading to increased vector storage, higher memory usage, and slower search times, while potentially losing the overall semantic context of the document.

The core innovation of Jina Embeddings 2 is a modified BERT architecture that incorporates Attention with Linear Biases (ALiBi) (Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation, 2021) instead of conventional positional embeddings. Unlike the causal ALiBi used in generative models, Jina employs a bidirectional ALiBi variant suitable for encoder models. This allows the model to effectively process sequence lengths much longer than those used during pre-training. The architecture also includes Gated Linear Units (GLU) in feedforward layers and uses post-layer normalization.

The training of Jina Embeddings 2 involves a three-stage process:

Pre-training a Modified BERT: A modified BERT model is trained from scratch on a large English corpus (C4 dataset) using a masked LLMing objective. Although the models are pre-trained on sequences capped at 512 tokens, the use of ALiBi enables them to extrapolate to longer lengths during inference.
Fine-tuning with Text Pairs: The pretrained model is fine-tuned on diverse text pairs using the InfoNCE loss function. This stage aims to make the embeddings of related texts close in the vector space while pushing unrelated texts apart.
Fine-tuning with Hard Negatives: In the final stage, the model is further fine-tuned using text pairs augmented with hard negative examples. This stage is crucial for improving performance on tasks like retrieval by teaching the model to discriminate between relevant passages and similar-but-irrelevant ones, using a modified InfoNCE loss. Memory optimizations such as mixed precision, DeepSpeed, and activation checkpointing are employed to facilitate training with large batch sizes, which is important for contrastive learning objectives.

The evaluation demonstrates the effectiveness of Jina Embeddings 2. Pretrained models achieve competitive performance on the GLUE benchmark compared to standard BERT and RoBERTa models. Crucially, experiments show that the Masked LLMing (MLM) accuracy of Jina models remains stable when processing sequences up to 8192 tokens during inference, confirming ALiBi's extrapolation capability, even though training sequences were limited to 512 tokens.

The embedding models (Jina-Embeddings-v2-Small-en, 33M parameters, and Jina-Embeddings-v2-Base-en, 137M parameters, both available on Hugging Face) were evaluated on the Massive Text Embedding Benchmark (MTEB) (MTEB: Massive Text Embedding Benchmark, 2022). They show competitive performance across a range of tasks including classification, clustering, pair classification, reranking, retrieval, and semantic similarity, often matching or surpassing strong baselines like E5 and OpenAI's text-embedding-ada-002 on various tasks.

To specifically assess performance on long documents, the authors introduced new clustering tasks (PatentClustering, WikiCitiesClustering) and a retrieval task (NarrativeQA) with lengthy texts, in addition to evaluating on SciFact. Results indicate that utilizing the extended context length (up to 8192 tokens) generally leads to improved performance on these long-document tasks, particularly for NarrativeQA and PatentClustering. The paper notes that in some cases, such as WikiCities clustering, longer contexts might slightly decrease performance, potentially due to the dilution of key distinguishing information by less relevant details in the document tail. Evaluation on the LoCo benchmark further validates the models' strong performance on long-document retrieval tasks, showing competitive average nDCG@10 scores.

In summary, Jina Embeddings 2 provides a practical solution for generating high-quality text embeddings for documents up to 8192 tokens, a 16x increase in context length compared to many established models. By leveraging a modified BERT architecture with bidirectional ALiBi and a multi-stage training process, the models achieve state-of-the-art or competitive performance on standard benchmarks while offering significant advantages for applications involving long text data, such as enhanced information retrieval and clustering of long documents.

PDF Markdown

Related Papers

Tweets

https://twitter.com/bo_wangbo/status/1793376207092043815

https://twitter.com/bo_wangbo/status/1772887836984758619

https://twitter.com/RexDouglass/status/1757622878126911533

YouTube

Show All Videos