Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
117 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BMX: Entropy-weighted Similarity and Semantic-enhanced Lexical Search (2408.06643v2)

Published 13 Aug 2024 in cs.IR

Abstract: BM25, a widely-used lexical search algorithm, remains crucial in information retrieval despite the rise of pre-trained and LLMs (PLMs/LLMs). However, it neglects query-document similarity and lacks semantic understanding, limiting its performance. We revisit BM25 and introduce BMX, a novel extension of BM25 incorporating entropy-weighted similarity and semantic enhancement techniques. Extensive experiments demonstrate that BMX consistently outperforms traditional BM25 and surpasses PLM/LLM-based dense retrieval in long-context and real-world retrieval benchmarks. This study bridges the gap between classical lexical search and modern semantic approaches, offering a promising direction for future information retrieval research. The reference implementation of BMX can be found in Baguetter, which was created in the context of this work. The code can be found here: https://github.com/mixedbread-ai/baguetter.

Citations (1)

Summary

  • The paper introduces BM𝓧 by enhancing the BM25 algorithm with entropy-weighted similarity, addressing key limitations in lexical search.
  • It employs Weighted Query Augmentation and score normalization to efficiently bridge classical lexical methods with modern semantic retrieval.
  • Experimental evaluations on BEIR, LoCo, and BRIGHT benchmarks demonstrate that BM𝓧 outperforms BM25 and rivals embedding-based models.

The paper "BMX\mathcal{X}: Entropy-weighted Similarity and Semantic-enhanced Lexical Search" addresses a prominent issue in information retrieval (IR) by proposing an enhancement to the traditionally pivotal BM25 algorithm. BM25, a key lexical search algorithm, has faced limitations in query-document similarity and semantic understanding, which this paper aims to tackle through its proposed novel extensions. The proposed algorithm, BMX\mathcal{X}, integrates entropy-weighted similarity and introduces semantic enhancement techniques to bridge the gap between classical lexical search approaches and modern semantic IR methods.

Key Innovations and Methodology

BMX\mathcal{X} introduces three main innovations:

  1. Entropy-weighted Similarity: This component utilizes the entropy of individual query tokens to weight the similarity scores. High-frequency tokens, which may bias the retrieval towards less informative content, are addressed using entropy weighting. This ensures that less frequent but more informative tokens have a greater impact on the relevance scores.
  2. Weighted Query Augmentation (WQA): BMX\mathcal{X} proposes a technique for augmenting queries to incorporate semantic understanding, leveraging LLMs. This augmentation allows a one-time retrieval and ranking process, thus enhancing efficiency without multiple retrieval cycles.
  3. Score Normalization: To improve algorithm applicability, BMX\mathcal{X} normalizes the output scores, aiding in scenarios that require threshold-based retrieval.

The core BMX\mathcal{X} algorithm is mathematically defined to incorporate these enhancements, aligning with traditional BM25's framework but modifying it to account for token entropy and augmented queries. The normalization procedure ensures the scores fall within a predefined range, facilitating practical implementations.

Experimental Evaluation

The effectiveness of BMX\mathcal{X} is demonstrated through extensive experiments on established IR benchmarks: BEIR, LoCo, and BRIGHT. The detailed evaluations are summarized as follows:

BEIR Benchmark

BMX\mathcal{X} was rigorously tested against both BM25 and embedding-based retrieval models. Numerical results showed consistent outperformance over BM25 across various datasets within the BEIR benchmark. The paper found that when paired with WQA, BMX\mathcal{X} rivaled and sometimes surpassed state-of-the-art embedding-based models, demonstrating its robust handling of semantic nuances.

LoCo Benchmark

For long-context retrieval tasks, BMX\mathcal{X} excelled, outperforming both lexical and embedding-based models on most datasets, thereby validating its efficacy in scenarios with extended text contexts. This showcased BMX\mathcal{X}'s ability to maintain relevance in documents with a broader scope of content.

BRIGHT Benchmark

In an evaluation involving diverse domains, BMX\mathcal{X} with WQA outperformed other lexical models and competitive embedding models. This finding emphasizes BMX\mathcal{X}'s robustness across heterogeneous datasets, enhancing its utility for real-world IR challenges.

Practical and Theoretical Implications

From a practical perspective, BMX\mathcal{X} provides a significant improvement in retrieval quality without the extensive computational overhead of embedding-based models. Its incorporation of entropy-weighted similarity and semantic enhancements offers a balanced approach between lexical efficiency and semantic richness.

Theoretical continuations of this work may explore deeper integrations of various similarity metrics and enhanced augmentation strategies. Future developments in AI could further refine these techniques, potentially incorporating more sophisticated entropy measures or even context-aware augmentation methods. BMX\mathcal{X} paves the way for subsequent IR models that balance semantic depth with lexical retrieval efficiency, contributing valuable insights to the evolving understanding of information retrieval mechanisms.

Conclusion

In conclusion, the paper presents BMX\mathcal{X} as a formidable evolution of the BM25 algorithm, tailored to modern information retrieval needs. By addressing the limitations in query-document similarity and semantic understanding, BMX\mathcal{X} positions itself as a robust alternative capable of bridging classical and contemporary IR approaches. The research provides a foundational step forward, with future work potentially expanding these enhancements to further nuanced and computationally demanding retrieval tasks.