Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enhancing Lexicon-Based Text Embeddings with Large Language Models (2501.09749v1)

Published 16 Jan 2025 in cs.CL and cs.IR

Abstract: Recent LLMs have demonstrated exceptional performance on general-purpose text embedding tasks. While dense embeddings have dominated related research, we introduce the first Lexicon-based EmbeddiNgS (LENS) leveraging LLMs that achieve competitive performance on these tasks. Regarding the inherent tokenization redundancy issue and unidirectional attention limitations in traditional causal LLMs, LENS consolidates the vocabulary space through token embedding clustering, and investigates bidirectional attention and various pooling strategies. Specifically, LENS simplifies lexicon matching by assigning each dimension to a specific token cluster, where semantically similar tokens are grouped together, and unlocking the full potential of LLMs through bidirectional attention. Extensive experiments demonstrate that LENS outperforms dense embeddings on the Massive Text Embedding Benchmark (MTEB), delivering compact feature representations that match the sizes of dense counterparts. Notably, combining LENSE with dense embeddings achieves state-of-the-art performance on the retrieval subset of MTEB (i.e. BEIR).

Enhancing Lexicon-Based Text Embeddings with LLMs

The paper "Enhancing Lexicon-Based Text Embeddings with LLMs" presents a novel framework titled LENS, which seeks to leverage the capabilities of LLMs to produce lexicon-based embeddings. Traditional dense embeddings have predominantly been the focus of text representation research. However, this work argues for reconsidering lexicon-based embeddings, which offer certain advantages, such as better alignment with LLM pre-training objectives and enhanced interpretability. Despite their theoretical benefits, these embeddings are often sidelined due to the tokenization redundancy and unidirectional attention limitations associated with LLMs. This paper addresses these challenges by introducing LENS, which clusters token embeddings and utilizes bidirectional attention, potentially transforming LLMs into powerful tools for generating lexicon-based embeddings.

Methodology

LENS operates by first addressing the inefficiencies of LLM tokenizers, which often produce redundant subword tokens. The framework uses token embedding clustering to group semantically similar tokens. It then substitutes these token clusters for the original token embeddings in the LLMing head, significantly reducing the dimensionality and redundancy of lexicon-based embeddings. Furthermore, LENS modifies the LLMs to incorporate bidirectional attention, thereby allowing every token to effectively utilize its context. This is critical for lexicon-based embeddings, which derive meaning from the output of all tokens rather than just relying on subsequents.

To generate these embeddings, LENS feeds model-instructed query-passage pairs into the LLM, following a structured input design similar to BGE-en-ICL. This model processes a mixture of training data spanning retrieval, classification, and clustering tasks, ensuring the embeddings are versatile across various applications. The embeddings are produced by applying log-saturation and max-pooling functions, yielding a compact representation with greatly reduced dimensions compared to traditional lexicon-based embeddings.

Results

The experimental results demonstrate that LENS outperforms many state-of-the-art dense embedding models across a wide array of tasks in the Massive Text Embedding Benchmark (MTEB). Notably, LENS-8000 not only surpasses its dense counterpart using the same backbone model (Mistral-7B) but also ranks highly across models trained on fully public data. LENS achieves state-of-the-art results on zero-shot tasks among public data-trained models, outperforming dense embeddings on retrieval benchmarks such as BEIR.

Implications and Future Directions

The introduction of LENS represents a significant step forward in exploiting the architectures of LLMs to optimize lexicon-based embeddings. The implications for retrieval and semantic understanding tasks are particularly profound, providing a compact yet powerful alternative to dense embeddings. The transparent nature of lexicon-based embeddings also introduces new opportunities for model interpretability, essential for explainability in AI systems.

Future research could further explore the integration of LENS with other embedding techniques or adaptations to multilingual contexts, expanding its utility. Additionally, refining the clustering techniques and examining different attention mechanisms could yield further improvements in performance, allowing lexicon-based embeddings to reach their full potential in diverse applications within AI and beyond.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yibin Lei (9 papers)
  2. Tao Shen (87 papers)
  3. Yu Cao (129 papers)
  4. Andrew Yates (59 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com