Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents (2310.19923v4)

Published 30 Oct 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Text embedding models have emerged as powerful tools for transforming sentences into fixed-sized feature vectors that encapsulate semantic information. While these models are essential for tasks like information retrieval, semantic clustering, and text re-ranking, most existing open-source models, especially those built on architectures like BERT, struggle to represent lengthy documents and often resort to truncation. One common approach to mitigate this challenge involves splitting documents into smaller paragraphs for embedding. However, this strategy results in a much larger set of vectors, consequently leading to increased memory consumption and computationally intensive vector searches with elevated latency. To address these challenges, we introduce Jina Embeddings 2, an open-source text embedding model capable of accommodating up to 8192 tokens. This model is designed to transcend the conventional 512-token limit and adeptly process long documents. Jina Embeddings 2 not only achieves state-of-the-art performance on a range of embedding-related tasks in the MTEB benchmark but also matches the performance of OpenAI's proprietary ada-002 model. Additionally, our experiments indicate that an extended context can enhance performance in tasks such as NarrativeQA.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, 2019.
  2. Train short, test long: Attention with linear biases enables input length extrapolation, 2022.
  3. Jina embeddings: A novel set of high-performance sentence embedding models. arXiv preprint arXiv:2307.11224, 2023.
  4. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022.
  5. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391–407, 1990.
  6. Latent dirichlet allocation. In T. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems, volume 14. MIT Press, 2001. URL https://proceedings.neurips.cc/paper_files/paper/2001/file/296472c9542ad4d4788d543508116cbc-Paper.pdf.
  7. Simcse: Simple contrastive learning of sentence embeddings, 2022.
  8. Condenser: a pre-training architecture for dense retrieval, 2021.
  9. Retromae: Pre-training retrieval-oriented language models via masked auto-encoder, 2022.
  10. Towards general text embeddings with multi-stage contrastive learning, 2023.
  11. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models, 2021.
  12. Mteb: Massive text embedding benchmark, 2023.
  13. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
  14. Roberta: A robustly optimized bert pretraining approach, 2019a.
  15. Language modeling with gated convolutional networks. CoRR, abs/1612.08083, 2016. URL http://arxiv.org/abs/1612.08083.
  16. Noam Shazeer. GLU variants improve transformer. CoRR, abs/2002.05202, 2020. URL https://arxiv.org/abs/2002.05202.
  17. Attention is all you need. CoRR, abs/1706.03762, 2017. URL http://arxiv.org/abs/1706.03762.
  18. Megatron-lm: Training multi-billion parameter language models using model parallelism. CoRR, abs/1909.08053, 2019. URL http://arxiv.org/abs/1909.08053.
  19. Transformers without tears: Improving the normalization of self-attention. CoRR, abs/1910.05895, 2019. URL http://arxiv.org/abs/1910.05895.
  20. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  21. Fixing weight decay regularization in adam. CoRR, abs/1711.05101, 2017. URL http://arxiv.org/abs/1711.05101.
  22. Mixed precision training. In International Conference on Learning Representations, 2018.
  23. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020.
  24. Promptagator: Few-shot dense retrieval from 8 examples. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=gmL46YMpu2J.
  25. Representation learning with contrastive predictive coding. CoRR, abs/1807.03748, 2018. URL http://arxiv.org/abs/1807.03748.
  26. Understanding the behaviour of contrastive loss. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2495–2504. IEEE, 2021.
  27. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268, 2016.
  28. Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics, 2019.
  29. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, 2015.
  30. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
  31. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019b. URL http://arxiv.org/abs/1907.11692.
  32. GLUE: A multi-task benchmark and analysis platform for natural language understanding. CoRR, abs/1804.07461, 2018. URL http://arxiv.org/abs/1804.07461.
  33. Sentence encoders on stilts: Supplementary training on intermediate labeled-data tasks. CoRR, abs/1811.01088, 2018. URL http://arxiv.org/abs/1811.01088.
  34. BIGPATENT: A large-scale dataset for abstractive and coherent summarization. CoRR, abs/1906.03741, 2019. URL http://arxiv.org/abs/1906.03741.
  35. Wikimedia Foundation. Wikimedia downloads, 2022. URL https://dumps.wikimedia.org.
  36. Fact or fiction: Verifying scientific claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7534–7550, 2020.
Citations (42)

Summary

  • The paper presents a modified BERT with bidirectional ALiBi to extend the input length to 8192 tokens.
  • It employs a three-stage training process, including hard negatives, to optimize contrastive learning for diverse tasks.
  • Evaluations on benchmarks like GLUE and MTEB, plus new long-document tasks, confirm competitive and improved performance.

This paper introduces Jina Embeddings 2 (Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents, 2023), a new suite of open-source text embedding models designed to overcome the typical 512-token input length limitation of most existing models, particularly those based on BERT. This limitation necessitates splitting long documents for embedding, leading to increased vector storage, higher memory usage, and slower search times, while potentially losing the overall semantic context of the document.

The core innovation of Jina Embeddings 2 is a modified BERT architecture that incorporates Attention with Linear Biases (ALiBi) (Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation, 2021) instead of conventional positional embeddings. Unlike the causal ALiBi used in generative models, Jina employs a bidirectional ALiBi variant suitable for encoder models. This allows the model to effectively process sequence lengths much longer than those used during pre-training. The architecture also includes Gated Linear Units (GLU) in feedforward layers and uses post-layer normalization.

The training of Jina Embeddings 2 involves a three-stage process:

  1. Pre-training a Modified BERT: A modified BERT model is trained from scratch on a large English corpus (C4 dataset) using a masked LLMing objective. Although the models are pre-trained on sequences capped at 512 tokens, the use of ALiBi enables them to extrapolate to longer lengths during inference.
  2. Fine-tuning with Text Pairs: The pretrained model is fine-tuned on diverse text pairs using the InfoNCE loss function. This stage aims to make the embeddings of related texts close in the vector space while pushing unrelated texts apart.
  3. Fine-tuning with Hard Negatives: In the final stage, the model is further fine-tuned using text pairs augmented with hard negative examples. This stage is crucial for improving performance on tasks like retrieval by teaching the model to discriminate between relevant passages and similar-but-irrelevant ones, using a modified InfoNCE loss. Memory optimizations such as mixed precision, DeepSpeed, and activation checkpointing are employed to facilitate training with large batch sizes, which is important for contrastive learning objectives.

The evaluation demonstrates the effectiveness of Jina Embeddings 2. Pretrained models achieve competitive performance on the GLUE benchmark compared to standard BERT and RoBERTa models. Crucially, experiments show that the Masked LLMing (MLM) accuracy of Jina models remains stable when processing sequences up to 8192 tokens during inference, confirming ALiBi's extrapolation capability, even though training sequences were limited to 512 tokens.

The embedding models (Jina-Embeddings-v2-Small-en, 33M parameters, and Jina-Embeddings-v2-Base-en, 137M parameters, both available on Hugging Face) were evaluated on the Massive Text Embedding Benchmark (MTEB) (MTEB: Massive Text Embedding Benchmark, 2022). They show competitive performance across a range of tasks including classification, clustering, pair classification, reranking, retrieval, and semantic similarity, often matching or surpassing strong baselines like E5 and OpenAI's text-embedding-ada-002 on various tasks.

To specifically assess performance on long documents, the authors introduced new clustering tasks (PatentClustering, WikiCitiesClustering) and a retrieval task (NarrativeQA) with lengthy texts, in addition to evaluating on SciFact. Results indicate that utilizing the extended context length (up to 8192 tokens) generally leads to improved performance on these long-document tasks, particularly for NarrativeQA and PatentClustering. The paper notes that in some cases, such as WikiCities clustering, longer contexts might slightly decrease performance, potentially due to the dilution of key distinguishing information by less relevant details in the document tail. Evaluation on the LoCo benchmark further validates the models' strong performance on long-document retrieval tasks, showing competitive average nDCG@10 scores.

In summary, Jina Embeddings 2 provides a practical solution for generating high-quality text embeddings for documents up to 8192 tokens, a 16x increase in context length compared to many established models. By leveraging a modified BERT architecture with bidirectional ALiBi and a multi-stage training process, the models achieve state-of-the-art or competitive performance on standard benchmarks while offering significant advantages for applications involving long text data, such as enhanced information retrieval and clustering of long documents.

Youtube Logo Streamline Icon: https://streamlinehq.com