BMX: Entropy-weighted Similarity and Semantic-enhanced Lexical Search (2408.06643v2)

Published 13 Aug 2024 in cs.IR

Abstract: BM25, a widely-used lexical search algorithm, remains crucial in information retrieval despite the rise of pre-trained and LLMs (PLMs/LLMs). However, it neglects query-document similarity and lacks semantic understanding, limiting its performance. We revisit BM25 and introduce BMX, a novel extension of BM25 incorporating entropy-weighted similarity and semantic enhancement techniques. Extensive experiments demonstrate that BMX consistently outperforms traditional BM25 and surpasses PLM/LLM-based dense retrieval in long-context and real-world retrieval benchmarks. This study bridges the gap between classical lexical search and modern semantic approaches, offering a promising direction for future information retrieval research. The reference implementation of BMX can be found in Baguetter, which was created in the context of this work. The code can be found here: https://github.com/mixedbread-ai/baguetter.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces BM𝓧 by enhancing the BM25 algorithm with entropy-weighted similarity, addressing key limitations in lexical search.
It employs Weighted Query Augmentation and score normalization to efficiently bridge classical lexical methods with modern semantic retrieval.
Experimental evaluations on BEIR, LoCo, and BRIGHT benchmarks demonstrate that BM𝓧 outperforms BM25 and rivals embedding-based models.

BM $\mathcal{X}$ : Entropy-weighted Similarity and Semantic-enhanced Lexical Search

The paper "BM $\mathcal{X}$ : Entropy-weighted Similarity and Semantic-enhanced Lexical Search" addresses a prominent issue in information retrieval (IR) by proposing an enhancement to the traditionally pivotal BM25 algorithm. BM25, a key lexical search algorithm, has faced limitations in query-document similarity and semantic understanding, which this paper aims to tackle through its proposed novel extensions. The proposed algorithm, BM $\mathcal{X}$ , integrates entropy-weighted similarity and introduces semantic enhancement techniques to bridge the gap between classical lexical search approaches and modern semantic IR methods.

Key Innovations and Methodology

BM $\mathcal{X}$ introduces three main innovations:

Entropy-weighted Similarity: This component utilizes the entropy of individual query tokens to weight the similarity scores. High-frequency tokens, which may bias the retrieval towards less informative content, are addressed using entropy weighting. This ensures that less frequent but more informative tokens have a greater impact on the relevance scores.
Weighted Query Augmentation (WQA): BM $\mathcal{X}$ proposes a technique for augmenting queries to incorporate semantic understanding, leveraging LLMs. This augmentation allows a one-time retrieval and ranking process, thus enhancing efficiency without multiple retrieval cycles.
Score Normalization: To improve algorithm applicability, BM $\mathcal{X}$ normalizes the output scores, aiding in scenarios that require threshold-based retrieval.

The core BM $\mathcal{X}$ algorithm is mathematically defined to incorporate these enhancements, aligning with traditional BM25's framework but modifying it to account for token entropy and augmented queries. The normalization procedure ensures the scores fall within a predefined range, facilitating practical implementations.

Experimental Evaluation

The effectiveness of BM $\mathcal{X}$ is demonstrated through extensive experiments on established IR benchmarks: BEIR, LoCo, and BRIGHT. The detailed evaluations are summarized as follows:

BEIR Benchmark

BM $\mathcal{X}$ was rigorously tested against both BM25 and embedding-based retrieval models. Numerical results showed consistent outperformance over BM25 across various datasets within the BEIR benchmark. The paper found that when paired with WQA, BM $\mathcal{X}$ rivaled and sometimes surpassed state-of-the-art embedding-based models, demonstrating its robust handling of semantic nuances.

LoCo Benchmark

For long-context retrieval tasks, BM $\mathcal{X}$ excelled, outperforming both lexical and embedding-based models on most datasets, thereby validating its efficacy in scenarios with extended text contexts. This showcased BM $\mathcal{X}$ 's ability to maintain relevance in documents with a broader scope of content.

BRIGHT Benchmark

In an evaluation involving diverse domains, BM $\mathcal{X}$ with WQA outperformed other lexical models and competitive embedding models. This finding emphasizes BM $\mathcal{X}$ 's robustness across heterogeneous datasets, enhancing its utility for real-world IR challenges.

Practical and Theoretical Implications

From a practical perspective, BM $\mathcal{X}$ provides a significant improvement in retrieval quality without the extensive computational overhead of embedding-based models. Its incorporation of entropy-weighted similarity and semantic enhancements offers a balanced approach between lexical efficiency and semantic richness.

Theoretical continuations of this work may explore deeper integrations of various similarity metrics and enhanced augmentation strategies. Future developments in AI could further refine these techniques, potentially incorporating more sophisticated entropy measures or even context-aware augmentation methods. BM $\mathcal{X}$ paves the way for subsequent IR models that balance semantic depth with lexical retrieval efficiency, contributing valuable insights to the evolving understanding of information retrieval mechanisms.

Conclusion

In conclusion, the paper presents BM $\mathcal{X}$ as a formidable evolution of the BM25 algorithm, tailored to modern information retrieval needs. By addressing the limitations in query-document similarity and semantic understanding, BM $\mathcal{X}$ positions itself as a robust alternative capable of bridging classical and contemporary IR approaches. The research provides a foundational step forward, with future work potentially expanding these enhancements to further nuanced and computationally demanding retrieval tasks.

PDF Markdown

Related Papers

GitHub

GitHub - mixedbread-ai/baguetter: Baguetter is a flexible, efficient, and hackable search engine library implemented in Python. It's designed for quickly benchmarking, implementing, and testing new search methods. Baguetter supports sparse (traditional), dense (semantic), and hybrid retrieval methods. (158 stars)

Tweets

https://twitter.com/aaxsh18/status/1823797782702125462

https://twitter.com/mixedbreadai/status/1823797994799628605

https://twitter.com/xmlee97/status/1858676300527136968