Enhancing Lexicon-Based Text Embeddings with LLMs
The paper "Enhancing Lexicon-Based Text Embeddings with LLMs" presents a novel framework titled LENS, which seeks to leverage the capabilities of LLMs to produce lexicon-based embeddings. Traditional dense embeddings have predominantly been the focus of text representation research. However, this work argues for reconsidering lexicon-based embeddings, which offer certain advantages, such as better alignment with LLM pre-training objectives and enhanced interpretability. Despite their theoretical benefits, these embeddings are often sidelined due to the tokenization redundancy and unidirectional attention limitations associated with LLMs. This paper addresses these challenges by introducing LENS, which clusters token embeddings and utilizes bidirectional attention, potentially transforming LLMs into powerful tools for generating lexicon-based embeddings.
Methodology
LENS operates by first addressing the inefficiencies of LLM tokenizers, which often produce redundant subword tokens. The framework uses token embedding clustering to group semantically similar tokens. It then substitutes these token clusters for the original token embeddings in the LLMing head, significantly reducing the dimensionality and redundancy of lexicon-based embeddings. Furthermore, LENS modifies the LLMs to incorporate bidirectional attention, thereby allowing every token to effectively utilize its context. This is critical for lexicon-based embeddings, which derive meaning from the output of all tokens rather than just relying on subsequents.
To generate these embeddings, LENS feeds model-instructed query-passage pairs into the LLM, following a structured input design similar to BGE-en-ICL. This model processes a mixture of training data spanning retrieval, classification, and clustering tasks, ensuring the embeddings are versatile across various applications. The embeddings are produced by applying log-saturation and max-pooling functions, yielding a compact representation with greatly reduced dimensions compared to traditional lexicon-based embeddings.
Results
The experimental results demonstrate that LENS outperforms many state-of-the-art dense embedding models across a wide array of tasks in the Massive Text Embedding Benchmark (MTEB). Notably, LENS-8000 not only surpasses its dense counterpart using the same backbone model (Mistral-7B) but also ranks highly across models trained on fully public data. LENS achieves state-of-the-art results on zero-shot tasks among public data-trained models, outperforming dense embeddings on retrieval benchmarks such as BEIR.
Implications and Future Directions
The introduction of LENS represents a significant step forward in exploiting the architectures of LLMs to optimize lexicon-based embeddings. The implications for retrieval and semantic understanding tasks are particularly profound, providing a compact yet powerful alternative to dense embeddings. The transparent nature of lexicon-based embeddings also introduces new opportunities for model interpretability, essential for explainability in AI systems.
Future research could further explore the integration of LENS with other embedding techniques or adaptations to multilingual contexts, expanding its utility. Additionally, refining the clustering techniques and examining different attention mechanisms could yield further improvements in performance, allowing lexicon-based embeddings to reach their full potential in diverse applications within AI and beyond.