Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LongEmbed: Extending Embedding Models for Long Context Retrieval (2404.12096v3)

Published 18 Apr 2024 in cs.CL and cs.LG
LongEmbed: Extending Embedding Models for Long Context Retrieval

Abstract: Embedding models play a pivot role in modern NLP applications such as IR and RAG. While the context limit of LLMs has been pushed beyond 1 million tokens, embedding models are still confined to a narrow context window not exceeding 8k tokens, refrained from application scenarios requiring long inputs such as legal contracts. This paper explores context window extension of existing embedding models, pushing the limit to 32k without requiring additional training. First, we examine the performance of current embedding models for long context retrieval on our newly constructed LongEmbed benchmark. LongEmbed comprises two synthetic tasks and four carefully chosen real-world tasks, featuring documents of varying length and dispersed target information. Benchmarking results underscore huge room for improvement in these models. Based on this, comprehensive experiments show that training-free context window extension strategies like position interpolation can effectively extend the context window of existing embedding models by several folds, regardless of their original context being 512 or beyond 4k. Furthermore, for models employing absolute position encoding (APE), we show the possibility of further fine-tuning to harvest notable performance gains while strictly preserving original behavior for short inputs. For models using rotary position embedding (RoPE), significant enhancements are observed when employing RoPE-specific methods, such as NTK and SelfExtend, indicating RoPE's superiority over APE for context window extension. To facilitate future research, we release E5-Base-4k and E5-RoPE-Base, along with the LongEmbed benchmark.

Extending Context Window in Embedding Models for Enhanced Long Input Processing

Introduction and Motivation

Embedding models are fundamental to various NLP applications, yet they have traditionally been limited by narrow context windows. This paper sets forth a comprehensive exploration into strategies for extending the context windows of existing embedding models without necessitating retraining. The focus is on enhancing performance for long input scenarios, such as lengthy documents or detailed contracts, where traditional models falter due to their typical limitation of 512 to 8k tokens.

Benchmarking Current Models

The paper introduces LongEmbed, a new benchmark designed to critically assess the performance of embedding models across extended contexts. LongEmbed includes both synthetic and real-world tasks tailored to challenge the models with inputs significantly exceeding traditional lengths. The results from these benchmarks highlighted considerable room for improvement, as current models struggled with effectively managing longer contexts.

Strategies for Context Extension

Several methodologies were tested for extending the operational range of these models:

  • Position Interpolation and Reorganization: Methods like parallel context windows and position interpolation proved effective across various models, multiplying the effective context window several-fold.
  • RoPE and APE Comparisons: Distinct strategies were tailored for models based on their position encoding methodologies — Absolute Position Encoding (APE) and Rotary Position Embedding (RoPE). For APE, techniques like position interpolation allowed for extended context processing without additional training. RoPE models, however, benefited significantly from RoPE-specific methods like NTK (Neural Tangent Kernel-aware Interpolation) and SelfExtend, which leverage their inherent handling of relative positions.

Empirical Findings

The empirical studies conducted showed remarkable results:

  • APE-based models could handle increased token loads with extended position embedding techniques, with fine-tuning offering further benefits while preserving performance on shorter inputs.
  • RoPE-based models saw substantial improvements using RoPE-specific extensions, demonstrating their potential for managing even longer inputs effectively — for instance, extending E5-Mistral's context window to 32k tokens improved performance metrics significantly.

Implications and Future Work

The insights from this paper suggest substantial implications for the development of more efficient and capable embedding models. The demonstrated superiority of RoPE in handling extended contexts proposes a shift in model design preferences for future embedding tasks. Moreover, the methodologies and new benchmark introduced here provide a foundation for further research into embedding model enhancements.

The research also sets the stage for exploring additional strategies in context window extension and fine-tuning, and stresses the benefit of shared benchmarks like LongEmbed for consistent evaluation and comparison of future models.

Conclusion

Overall, this work not only advances our understanding of how embedding models can be adapted to manage longer contexts effectively but also underlines the importance of model and methodology choices in achieving high performance in long-input scenarios. Researchers are encouraged to leverage the findings and tools made available through this paper to propel the capabilities of NLP applications further.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Training-free long-context scaling of large language models. arXiv preprint arXiv:2402.17463, 2024.
  2. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023a.
  3. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023b.
  4. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. arXiv preprint arXiv:2402.03216, 2024.
  5. Summscreen: A dataset for abstractive screenplay summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8602–8615, 2022.
  6. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023.
  7. Overcoming a theoretical limitation of self-attention. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  7654–7664, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.527. URL https://aclanthology.org/2022.acl-long.527.
  8. Dwell in the beginning: How language models embed long documents for dense retrieval. arXiv preprint arXiv:2404.04163, 2024.
  9. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391–407, 1990.
  10. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  11. Longrope: Extending llm context window beyond 2 million tokens. arXiv preprint arXiv:2402.13753, 2024.
  12. Data engineering for scaling language models to 128k context. arXiv preprint arXiv:2402.10171, 2024.
  13. Simcse: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  6894–6910, 2021.
  14. In-context autoencoder for context compression in a large language model. arXiv preprint arXiv:2307.06945, 2023.
  15. Jina embeddings 2: 8192-token general-purpose text embeddings for long documents. arXiv preprint arXiv:2310.19923, 2023.
  16. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics, pp.  6609–6625, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics. URL https://www.aclweb.org/anthology/2020.coling-main.580.
  17. Towards unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118, 2(3), 2021.
  18. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
  19. Llmlingua: Compressing prompts for accelerated inference of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  13358–13376, 2023.
  20. Llm maybe longlm: Self-extend llm context window without tuning. arXiv preprint arXiv:2401.01325, 2024.
  21. Greg Kamradt. Needle in a haystack - pressure testing llms. https://github.com/gkamradt/LLMTest_NeedleInAHaystack, 2023.
  22. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  6769–6781, 2020.
  23. The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328, 2018. doi: 10.1162/tacl_a_00023. URL https://aclanthology.org/Q18-1023.
  24. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466, 2019.
  25. xformers: A modular and hackable transformer modelling library. https://github.com/facebookresearch/xformers, 2022.
  26. Towards general text embeddings with multi-stage contrastive learning. arXiv preprint arXiv:2308.03281, 2023.
  27. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
  28. Landmark attention: Random-access infinite context length for transformers. arXiv preprint arXiv:2305.16300, 2023.
  29. Mteb: Massive text embedding benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.  2014–2037, 2023.
  30. Generative representational instruction tuning. arXiv preprint arXiv:2402.09906, 2024.
  31. Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005, 2022.
  32. Ms marco: A human-generated machine reading comprehension dataset. 2016.
  33. Large dual encoders are generalizable retrievers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  9844–9855, 2022.
  34. Nomic embed: Training a reproducible long context text embedder. arXiv preprint arXiv:2402.01613, 2024.
  35. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have, 2023.
  36. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
  37. Parallel context windows for large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  6383–6402, 2023.
  38. Randomized positional encodings boost length generalization of transformers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  1889–1903, 2023.
  39. Benchmarking and building long-context retrieval models with loco and m2-bert. arXiv preprint arXiv:2402.07440, 2024.
  40. SCROLLS: Standardized CompaRison over long language sequences. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  12007–12021, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.823. URL https://aclanthology.org/2022.emnlp-main.823.
  41. Jianlin Su. Understanding attention scaling from the perspective of entropy invariance. https://spaces.ac.cn/archives/8823, Dec 2021.
  42. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
  43. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
  44. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?id=wCu6T5xFjeJ.
  45. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  46. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022.
  47. Simlm: Pre-training with representation bottleneck for dense passage retrieval. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  2244–2258, 2023a.
  48. Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368, 2023b.
  49. Resonance rope: Improving context length generalization of large language models. arXiv preprint arXiv:2403.00071, 2024a.
  50. Augmenting language models with long-term memory. Advances in Neural Information Processing Systems, 36, 2024b.
  51. Infllm: Unveiling the intrinsic capacity of llms for understanding extremely long sequences with training-free memory. arXiv preprint arXiv:2402.04617, 2024.
  52. C-pack: Packaged resources to advance general chinese embedding. arXiv preprint arXiv:2309.07597, 2023.
  53. Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039, 2023.
  54. Long-context language modeling with parallel context encoding, 2024.
  55. Soaring from 4k to 400k: Extending llm’s context with activation beacon. arXiv preprint arXiv:2401.03462, 2024a.
  56. Extending llms’ context window with 100 samples. arXiv preprint arXiv:2401.07004, 2024b.
  57. QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization. In North American Association for Computational Linguistics (NAACL), 2021.
  58. Pose: Efficient context window extension of llms via positional skip-wise training. In The Twelfth International Conference on Learning Representations, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Dawei Zhu (46 papers)
  2. Liang Wang (512 papers)
  3. Nan Yang (182 papers)
  4. Yifan Song (48 papers)
  5. Wenhao Wu (71 papers)
  6. Furu Wei (291 papers)
  7. Sujian Li (82 papers)
Citations (12)
Youtube Logo Streamline Icon: https://streamlinehq.com