Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders (2404.05961v2)

Published 9 Apr 2024 in cs.CL and cs.AI
LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

Abstract: Large decoder-only LLMs are the state-of-the-art models on most of today's NLP tasks and benchmarks. Yet, the community is only slowly adopting these models for text embedding tasks, which require rich contextualized representations. In this work, we introduce LLM2Vec, a simple unsupervised approach that can transform any decoder-only LLM into a strong text encoder. LLM2Vec consists of three simple steps: 1) enabling bidirectional attention, 2) masked next token prediction, and 3) unsupervised contrastive learning. We demonstrate the effectiveness of LLM2Vec by applying it to 4 popular LLMs ranging from 1.3B to 8B parameters and evaluate the transformed models on English word- and sequence-level tasks. We outperform encoder-only models by a large margin on word-level tasks and reach a new unsupervised state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB). Moreover, when combining LLM2Vec with supervised contrastive learning, we achieve state-of-the-art performance on MTEB among models that train only on publicly available data (as of May 24, 2024). Our strong empirical results and extensive analysis demonstrate that LLMs can be effectively transformed into universal text encoders in a parameter-efficient manner without the need for expensive adaptation or synthetic GPT-4 generated data.

LLM2Vec: Transforming Decoder-Only LLMs into Universal Text Encoders

Introduction

In the field of NLP, text embedding models play a pivotal role in representing textual information as dense vectors, which can be efficiently utilized in downstream tasks like semantic similarity, information retrieval, and text classification. Historically, the construction of such models has been dominated by encoder-only or encoder-decoder architectures, which have been systematically trained and adapted to embed text effectively. However, the landscape of text embedding is witnessing a shift with the introduction of LLM2Vec, a novel approach that leverages the generative prowess of large decoder-only LLMs for the task of text embedding.

LLM2Vec Methodology

LLM2Vec pioneers an unsupervised technique to refine any pre-trained decoder-only LLM into a powerful text encoder. This transformation involves three critical steps: enabling bidirectional attention, applying masked next token prediction (MNTP), and incorporating unsupervised contrastive learning (SimCSE). This methodological trilogy not only mitigates the intrinsic causal attention limitations of LLMs but also unlocks their potential to generate comprehensive and context-aware text embeddings.

  • Enabling Bidirectional Attention: LLM2Vec initiates the transformation by replacing the causal attention mechanism with bidirectional attention, thereby allowing tokens to contextually influence each other throughout the sequence.
  • Masked Next Token Prediction (MNTP): To familiarize the model with its new bidirectional attention capabilities, MNTP is applied. This innovative training strategy blends the essence of masked LLMing with next token prediction, facilitating the model to harness both past and future context for accurate text representation.
  • Unsupervised Contrastive Learning (SimCSE): The final step employs SimCSE to enhance sequence-level embeddings via unsupervised contrastive learning, consolidating the model's ability to generate nuanced and meaningful text embeddings.

Empirical Validation

The efficacy of LLM2Vec is rigorously evaluated across different settings. When applied to popular LLMs (S-LLaMA-1.3B, LLaMA-2-7B, and Mistral-7B), LLM2Vec consistently outperforms encoder-only models in word-level tasks such as chunking, named-entity recognition (NER), and part-of-speech (POS) tagging. Furthermore, on the Massive Text Embedding Benchmark (MTEB), LLM2Vec sets new benchmarks in unsupervised text embedding, with the LLM2Vec-transformed Mistral-7B model achieving state-of-the-art performance among unsupervised models. Notably, the combination of LLM2Vec with supervised contrastive learning further propels the performance, establishing a new state-of-the-art on MTEB among models trained solely on publicly available data.

Analytical Insights

A deeper analysis into LLM2Vec-transformed models reveals their heightened capability to integrate information from future tokens, a critical attribute for generating robust sequence representations. Interestingly, the Mistral-7B model exhibits a surprisingly effective performance with bidirectional attention even before explicit training, suggesting potential pre-training with some form of bidirectional attention. This exceptional characteristic of Mistral models uncovers new avenues for exploring the underlying techniques used in their pre-training.

Implications and Future Directions

LLM2Vec not only evidences the untapped potential of decoder-only LLMs for text embedding tasks but also presents a computationally efficient method to repurpose these models into universal text encoders. The simplicity and effectiveness of LLM2Vec open up promising prospects for its application in resource-constrained scenarios, further democratizing access to state-of-the-art text embedding capabilities. The discovery of bidirectional traits in Mistral models beckons additional investigative efforts to unearth the methodologies employed in their pre-training, potentially enriching the corpus of knowledge in LLM pre-training strategies.

Conclusion

LLM2Vec ushers in a new era in text embedding, showcasing the transformative power of large decoder-only LLMs when equipped with bidirectional attention and refined through masked next token prediction and unsupervised contrastive learning. Its unparalleled performance across both unsupervised and supervised tasks heralds a paradigm shift in text embedding, promising substantial advancements in the efficiency and applicability of NLP models in real-world scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. SemEval-2014 task 10: Multilingual semantic textual similarity. In Preslav Nakov and Torsten Zesch (eds.), Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pp.  81–91, Dublin, Ireland, August 2014. Association for Computational Linguistics. doi: 10.3115/v1/S14-2010. URL https://aclanthology.org/S14-2010.
  2. Task-aware retrieval with instructions. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.  3650–3675, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.225. URL https://aclanthology.org/2023.findings-acl.225.
  3. Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, March 2022. doi: 10.1162/coli˙a˙00422. URL https://aclanthology.org/2022.cl-1.7.
  4. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  5. Semantic re-tuning with contrastive tension. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Ov_sMNau-PF.
  6. Long short-term memory-networks for machine reading. In Jian Su, Kevin Duh, and Xavier Carreras (eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.  551–561, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1053. URL https://aclanthology.org/D16-1053.
  7. PaLM: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023. URL http://jmlr.org/papers/v24/22-1144.html.
  8. Electra: Pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=r1xMH1BtvB.
  9. Supervised learning of universal sentence representations from natural language inference data. In Martha Palmer, Rebecca Hwa, and Sebastian Riedel (eds.), Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp.  670–680, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1070. URL https://aclanthology.org/D17-1070.
  10. Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=mZn2Xyh9Ec.
  11. Quora question pairs. 2017. URL https://kaggle.com/competitions/quora-question-pairs.
  12. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  13. ELI5: Long form question answering. In Anna Korhonen, David Traum, and Lluís Màrquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  3558–3567, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1346. URL https://aclanthology.org/P19-1346.
  14. SimCSE: Simple contrastive learning of sentence embeddings. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  6894–6910, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.552. URL https://aclanthology.org/2021.emnlp-main.552.
  15. DeBERTav3: Improving deBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=sE7-XhLxHA.
  16. DuReader: a Chinese machine reading comprehension dataset from real-world applications. In Eunsol Choi, Minjoon Seo, Danqi Chen, Robin Jia, and Jonathan Berant (eds.), Proceedings of the Workshop on Machine Reading for Question Answering, pp.  37–46, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-2605. URL https://aclanthology.org/W18-2605.
  17. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  18. Mistral 7B. arXiv preprint, 2023. URL https://arxiv.org/abs/2310.06825.
  19. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Regina Barzilay and Min-Yen Kan (eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URL https://aclanthology.org/P17-1147.
  20. Dense passage retrieval for open-domain question answering. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  6769–6781, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.550. URL https://aclanthology.org/2020.emnlp-main.550.
  21. ColBERT: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’20, pp.  39–48, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450380164. doi: 10.1145/3397271.3401075. URL https://doi.org/10.1145/3397271.3401075.
  22. Towards general text embeddings with multi-stage contrastive learning. arXiv preprint, 2023. URL https://arxiv.org/abs/2308.03281.
  23. A structured self-attentive sentence embedding. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=BJC_jUqxe.
  24. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint, 2019. URL http://arxiv.org/abs/1907.11692.
  25. Fine-tuning LLaMA for multi-stage text retrieval. arXiv preprint, 2023. URL https://arxiv.org/abs/2310.08319.
  26. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Byj72udxe.
  27. Niklas Muennighoff. SGPT: GPT sentence embeddings for semantic search. arXiv preprint, 2022. URL https://arxiv.org/abs/2202.08904.
  28. MTEB: Massive text embedding benchmark. In Andreas Vlachos and Isabelle Augenstein (eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.  2014–2037, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.eacl-main.148. URL https://aclanthology.org/2023.eacl-main.148.
  29. Generative representational instruction tuning. arXiv preprint, 2024. URL https://arxiv.org/abs/2402.09906.
  30. Text and code embeddings by contrastive pre-training. arXiv preprint, 2022. URL https://arxiv.org/abs/2201.10005.
  31. Large dual encoders are generalizable retrievers. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  9844–9855, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.669. URL https://aclanthology.org/2022.emnlp-main.669.
  32. Training language models to follow instructions with human feedback. arXiv preprint, 2022. URL https://arxiv.org/abs/2203.02155.
  33. A deep reinforced model for abstractive summarization. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=HkAClQgA-.
  34. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  35. SQuAD: 100,000+ questions for machine comprehension of text. In Jian Su, Kevin Duh, and Xavier Carreras (eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.  2383–2392, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1264. URL https://aclanthology.org/D16-1264.
  36. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  3982–3992, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1410. URL https://aclanthology.org/D19-1410.
  37. Repetition improves language model embeddings. arXiv preprint, 2024. URL https://arxiv.org/abs/2402.15449.
  38. One embedder, any task: Instruction-finetuned text embeddings. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.  1102–1121, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.71. URL https://aclanthology.org/2023.findings-acl.71.
  39. FEVER: A large-scale dataset for fact extraction and VERification. In Marilyn Walker, Heng Ji, and Amanda Stent (eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  809–819, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1074. URL https://aclanthology.org/N18-1074.
  40. Erik F. Tjong Kim Sang and Fien De Meulder. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp.  142–147, 2003. URL https://www.aclweb.org/anthology/W03-0419.
  41. Llama 2: Open foundation and fine-tuned chat models. preprint, 2023. URL https://arxiv.org/abs/2307.09288.
  42. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  43. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint, 2022a. URL https://arxiv.org/abs/2212.03533.
  44. Improving text embeddings with large language models. arXiv preprint, 2023. URL https://arxiv.org/abs/2401.00368.
  45. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  5085–5109, Abu Dhabi, United Arab Emirates, December 2022b. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.340. URL https://aclanthology.org/2022.emnlp-main.340.
  46. CLEAR: Contrastive learning for sentence representation. arXiv preprint, 2020. URL https://arxiv.org/abs/2012.15466.
  47. Sheared LLaMA: Accelerating language model pre-training via structured pruning. In Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@NeurIPS 2023), 2023. URL https://openreview.net/forum?id=6s77hjBNfS.
  48. C-Pack: Packaged resources to advance general chinese embedding. arXiv preprint, 2023. URL https://arxiv.org/abs/2309.07597.
  49. T2ranking: A large-scale chinese benchmark for passage ranking. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’23, pp.  2681–2690, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450394086. doi: 10.1145/3539618.3591874. URL https://doi.org/10.1145/3539618.3591874.
  50. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  2369–2380, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1259. URL https://aclanthology.org/D18-1259.
  51. Mr. TyDi: A multi-lingual benchmark for dense retrieval. In Duygu Ataman, Alexandra Birch, Alexis Conneau, Orhan Firat, Sebastian Ruder, and Gozde Gul Sahin (eds.), Proceedings of the 1st Workshop on Multilingual Representation Learning, pp.  127–137, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.mrl-1.12. URL https://aclanthology.org/2021.mrl-1.12.
  52. MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages. Transactions of the Association for Computational Linguistics, 11:1114–1131, 09 2023. ISSN 2307-387X. doi: 10.1162/tacl˙a˙00595. URL https://doi.org/10.1162/tacl_a_00595.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Parishad BehnamGhader (6 papers)
  2. Vaibhav Adlakha (9 papers)
  3. Marius Mosbach (27 papers)
  4. Dzmitry Bahdanau (46 papers)
  5. Nicolas Chapados (25 papers)
  6. Siva Reddy (82 papers)
Citations (96)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com