LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders (2404.05961v2)

Published 9 Apr 2024 in cs.CL and cs.AI

Abstract: Large decoder-only LLMs are the state-of-the-art models on most of today's NLP tasks and benchmarks. Yet, the community is only slowly adopting these models for text embedding tasks, which require rich contextualized representations. In this work, we introduce LLM2Vec, a simple unsupervised approach that can transform any decoder-only LLM into a strong text encoder. LLM2Vec consists of three simple steps: 1) enabling bidirectional attention, 2) masked next token prediction, and 3) unsupervised contrastive learning. We demonstrate the effectiveness of LLM2Vec by applying it to 4 popular LLMs ranging from 1.3B to 8B parameters and evaluate the transformed models on English word- and sequence-level tasks. We outperform encoder-only models by a large margin on word-level tasks and reach a new unsupervised state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB). Moreover, when combining LLM2Vec with supervised contrastive learning, we achieve state-of-the-art performance on MTEB among models that train only on publicly available data (as of May 24, 2024). Our strong empirical results and extensive analysis demonstrate that LLMs can be effectively transformed into universal text encoders in a parameter-efficient manner without the need for expensive adaptation or synthetic GPT-4 generated data.

PDF HTML Abstract

LLM2Vec: Transforming Decoder-Only LLMs into Universal Text Encoders

Introduction

In the field of NLP, text embedding models play a pivotal role in representing textual information as dense vectors, which can be efficiently utilized in downstream tasks like semantic similarity, information retrieval, and text classification. Historically, the construction of such models has been dominated by encoder-only or encoder-decoder architectures, which have been systematically trained and adapted to embed text effectively. However, the landscape of text embedding is witnessing a shift with the introduction of LLM2Vec, a novel approach that leverages the generative prowess of large decoder-only LLMs for the task of text embedding.

LLM2Vec Methodology

LLM2Vec pioneers an unsupervised technique to refine any pre-trained decoder-only LLM into a powerful text encoder. This transformation involves three critical steps: enabling bidirectional attention, applying masked next token prediction (MNTP), and incorporating unsupervised contrastive learning (SimCSE). This methodological trilogy not only mitigates the intrinsic causal attention limitations of LLMs but also unlocks their potential to generate comprehensive and context-aware text embeddings.

Enabling Bidirectional Attention: LLM2Vec initiates the transformation by replacing the causal attention mechanism with bidirectional attention, thereby allowing tokens to contextually influence each other throughout the sequence.
Masked Next Token Prediction (MNTP): To familiarize the model with its new bidirectional attention capabilities, MNTP is applied. This innovative training strategy blends the essence of masked LLMing with next token prediction, facilitating the model to harness both past and future context for accurate text representation.
Unsupervised Contrastive Learning (SimCSE): The final step employs SimCSE to enhance sequence-level embeddings via unsupervised contrastive learning, consolidating the model's ability to generate nuanced and meaningful text embeddings.

Empirical Validation

The efficacy of LLM2Vec is rigorously evaluated across different settings. When applied to popular LLMs (S-LLaMA-1.3B, LLaMA-2-7B, and Mistral-7B), LLM2Vec consistently outperforms encoder-only models in word-level tasks such as chunking, named-entity recognition (NER), and part-of-speech (POS) tagging. Furthermore, on the Massive Text Embedding Benchmark (MTEB), LLM2Vec sets new benchmarks in unsupervised text embedding, with the LLM2Vec-transformed Mistral-7B model achieving state-of-the-art performance among unsupervised models. Notably, the combination of LLM2Vec with supervised contrastive learning further propels the performance, establishing a new state-of-the-art on MTEB among models trained solely on publicly available data.

Analytical Insights

A deeper analysis into LLM2Vec-transformed models reveals their heightened capability to integrate information from future tokens, a critical attribute for generating robust sequence representations. Interestingly, the Mistral-7B model exhibits a surprisingly effective performance with bidirectional attention even before explicit training, suggesting potential pre-training with some form of bidirectional attention. This exceptional characteristic of Mistral models uncovers new avenues for exploring the underlying techniques used in their pre-training.

Implications and Future Directions

LLM2Vec not only evidences the untapped potential of decoder-only LLMs for text embedding tasks but also presents a computationally efficient method to repurpose these models into universal text encoders. The simplicity and effectiveness of LLM2Vec open up promising prospects for its application in resource-constrained scenarios, further democratizing access to state-of-the-art text embedding capabilities. The discovery of bidirectional traits in Mistral models beckons additional investigative efforts to unearth the methodologies employed in their pre-training, potentially enriching the corpus of knowledge in LLM pre-training strategies.

Conclusion

LLM2Vec ushers in a new era in text embedding, showcasing the transformative power of large decoder-only LLMs when equipped with bidirectional attention and refined through masked next token prediction and unsupervised contrastive learning. Its unparalleled performance across both unsupervised and supervised tasks heralds a paradigm shift in text embedding, promising substantial advancements in the efficiency and applicability of NLP models in real-world scenarios.