Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents (2310.19923v4)
Abstract: Text embedding models have emerged as powerful tools for transforming sentences into fixed-sized feature vectors that encapsulate semantic information. While these models are essential for tasks like information retrieval, semantic clustering, and text re-ranking, most existing open-source models, especially those built on architectures like BERT, struggle to represent lengthy documents and often resort to truncation. One common approach to mitigate this challenge involves splitting documents into smaller paragraphs for embedding. However, this strategy results in a much larger set of vectors, consequently leading to increased memory consumption and computationally intensive vector searches with elevated latency. To address these challenges, we introduce Jina Embeddings 2, an open-source text embedding model capable of accommodating up to 8192 tokens. This model is designed to transcend the conventional 512-token limit and adeptly process long documents. Jina Embeddings 2 not only achieves state-of-the-art performance on a range of embedding-related tasks in the MTEB benchmark but also matches the performance of OpenAI's proprietary ada-002 model. Additionally, our experiments indicate that an extended context can enhance performance in tasks such as NarrativeQA.
- Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, 2019.
- Train short, test long: Attention with linear biases enables input length extrapolation, 2022.
- Jina embeddings: A novel set of high-performance sentence embedding models. arXiv preprint arXiv:2307.11224, 2023.
- Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022.
- Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391–407, 1990.
- Latent dirichlet allocation. In T. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems, volume 14. MIT Press, 2001. URL https://proceedings.neurips.cc/paper_files/paper/2001/file/296472c9542ad4d4788d543508116cbc-Paper.pdf.
- Simcse: Simple contrastive learning of sentence embeddings, 2022.
- Condenser: a pre-training architecture for dense retrieval, 2021.
- Retromae: Pre-training retrieval-oriented language models via masked auto-encoder, 2022.
- Towards general text embeddings with multi-stage contrastive learning, 2023.
- Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models, 2021.
- Mteb: Massive text embedding benchmark, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
- Roberta: A robustly optimized bert pretraining approach, 2019a.
- Language modeling with gated convolutional networks. CoRR, abs/1612.08083, 2016. URL http://arxiv.org/abs/1612.08083.
- Noam Shazeer. GLU variants improve transformer. CoRR, abs/2002.05202, 2020. URL https://arxiv.org/abs/2002.05202.
- Attention is all you need. CoRR, abs/1706.03762, 2017. URL http://arxiv.org/abs/1706.03762.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. CoRR, abs/1909.08053, 2019. URL http://arxiv.org/abs/1909.08053.
- Transformers without tears: Improving the normalization of self-attention. CoRR, abs/1910.05895, 2019. URL http://arxiv.org/abs/1910.05895.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Fixing weight decay regularization in adam. CoRR, abs/1711.05101, 2017. URL http://arxiv.org/abs/1711.05101.
- Mixed precision training. In International Conference on Learning Representations, 2018.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020.
- Promptagator: Few-shot dense retrieval from 8 examples. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=gmL46YMpu2J.
- Representation learning with contrastive predictive coding. CoRR, abs/1807.03748, 2018. URL http://arxiv.org/abs/1807.03748.
- Understanding the behaviour of contrastive loss. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2495–2504. IEEE, 2021.
- Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268, 2016.
- Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics, 2019.
- A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, 2015.
- Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
- Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019b. URL http://arxiv.org/abs/1907.11692.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. CoRR, abs/1804.07461, 2018. URL http://arxiv.org/abs/1804.07461.
- Sentence encoders on stilts: Supplementary training on intermediate labeled-data tasks. CoRR, abs/1811.01088, 2018. URL http://arxiv.org/abs/1811.01088.
- BIGPATENT: A large-scale dataset for abstractive and coherent summarization. CoRR, abs/1906.03741, 2019. URL http://arxiv.org/abs/1906.03741.
- Wikimedia Foundation. Wikimedia downloads, 2022. URL https://dumps.wikimedia.org.
- Fact or fiction: Verifying scientific claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7534–7550, 2020.