LexBoost Retrieval Enhancement
- LexBoost is a sparse retrieval enhancement technique that integrates lexical matching with semantic neighborhood propagation using an offline dense document neighbor graph to overcome vocabulary mismatch.
- It improves retrieval precision and recall by combining BM25's speed with enriched semantic signals from precomputed document neighbors via a tunable fusion parameter.
- Empirical evaluations on TREC and CORD-19 datasets show significant gains in MAP (+7.1% to +16.8%) with minimal query-time overhead.
LexBoost refers to a sparse retrieval enhancement technique that fuses lexical document matching with semantic neighborhood propagation by leveraging an offline-constructed dense document neighbor graph. In the information retrieval context, LexBoost is designed to address the limitations of traditional sparse retrievers based on token overlap—typified by models such as BM25—by incorporating the Cluster Hypothesis: documents that are similar in semantic space are more likely to be mutually relevant to a given query. LexBoost executes this fusion strategy with minimal online cost, combining the speed and scalability of inverted-index-based retrieval with a form of semantic signal produced through dense vector-based analysis, without requiring computationally intensive reranking at query time. Empirically, it produces statistically significant gains in precision and recall relative to lexical baselines, with overhead approaching that of standard BM25 (Kulkarni et al., 2024).
LexBoost is distinct from, but often confused with, “LexiBoost”—a boosting framework for class-imbalanced machine learning based on lexicographic programming (Datta et al., 2017). Unless specifically discussing imbalanced classification via LP-boosting, LexBoost refers to the retrieval-centric method described above.
1. Motivation and Theoretical Foundation
Traditional sparse retrievers such as BM25 exploit token intersection between queries and corpus documents, enabling sub-10 ms/query response times with inverted indices, but are susceptible to vocabulary mismatch—substantially reducing recall and precision when queries and relevant documents use semantically equivalent but disjoint surface forms. Dense retrieval methods, notably those using dual-encoder architectures based on models like BERT, project queries and documents into a shared vector space and compare by cosine or inner product similarity, capturing semantic relations missed by lexical overlap. However, dense retrieval is bottlenecked by nearest-neighbor search, and even the fastest approximate methods (e.g., HNSW, IVF) incur latencies one to two orders of magnitude above pure lexical retrieval.
LexBoost’s key insight is the offline exploitation of the Cluster Hypothesis, which posits that documents close in semantic space tend to share query relevance. By constructing a dense neighbor graph offline and, at query time, augmenting each candidate document’s lexical score with evidence from its k nearest neighbors, LexBoost provides a principled mechanism for integrating semantic evidence into a sparse search pipeline with negligible query-time slow-down (Kulkarni et al., 2024).
2. Corpus Graph Construction
LexBoost requires offline computation of a k-nearest-neighbor (k-NN) graph over the document set using dense encodings. Given a corpus and a dense encoder , each document is mapped to an embedding . For each , the most similar documents are identified by cosine similarity, creating directed edges to establish the document neighbor graph .
The computational procedure is:
- For each document , compute .
- For each , identify , the top documents by cosine similarity to , for .
- Record directed edges in .
This process has time complexity (due to all-pairs similarity), with space to store the resulting graph. The graph is constructed once at indexing time and is fixed for retrieval. Implementations in the original evaluation considered both TCT-ColBERT-HNP and TAS-B as dense models for graph construction, demonstrating agnosticism to the underlying transformer encoder (Kulkarni et al., 2024).
3. Query-Time Fusion and Ranking Algorithm
At retrieval time, for a given query , the BM25 (or similar lexical) scores are first computed for an initial candidate set (typically the top documents, with ). For each , LexBoost retrieves the precomputed list of nearest neighbors from and obtains their lexical scores (set to zero for ). The final LexBoost score is then computed as:
Here, is a tunable parameter balancing local (document) and neighborhood lexical evidence; empirical results indicate a broad plateau of optimality for .
The augmented candidate set is then re-ranked by , selecting the top for the final response. The main runtime cost over vanilla BM25 is BM25 lookups and corresponding arithmetic operations, resulting in practical online overhead of less than 1 ms per query, or <10% over baseline (Kulkarni et al., 2024).
4. Efficiency Characteristics
A summary of efficiency and scalability:
| Component | Latency per Query | Complexity | Remarks |
|---|---|---|---|
| BM25 (baseline) | ~10 ms | Single CPU, inverted index | |
| LexBoost | ~11 ms | , , +1 ms over BM25 | |
| HNSW (approx. dense retr.) | 50–100 ms | Orders of magnitude slower than BM25 | |
| Exhaustive dense retr. | >200 ms | Not practical at scale |
LexBoost maintains near-BM25 speed even on large corpora, in sharp contrast to standard and approximate dense methods that require online embedding computation or traversal of complex data structures (Kulkarni et al., 2024).
5. Experimental Validation
LexBoost was evaluated on TREC 2019/2020 Deep Learning (MS MARCO Passage) and TREC-COVID (CORD-19) datasets. Baselines include BM25, DFR-PL2, DFR-DPH, QLD, approximate dense (HNSW), and BM25→dense reranking. Key evaluation metrics are MAP, nDCG@10/100/1000, and Recall@1000 (with relevance threshold 2).
Representative results:
| Dataset | Baseline (MAP) | LexBoost (k=16, α=0.7) | Relative Gain |
|---|---|---|---|
| TREC DL 2019 | 0.3877 | 0.4150 | +7.1% |
| TREC DL 2020 | 0.3609 | 0.3897 | +7.9% |
| TREC-COVID | 0.2525 | 0.2950 () | +16.8% |
Improvements are statistically significant by paired t-test (). Augmenting LexBoost with transformer-based rerankers yields performance rivaling exhaustive dense reranking at a fraction of the latency.
6. Sensitivity, Robustness, and Design Variants
Performance is robust to both the number of neighbors and the fusion weight . Gains increase steadily with up to 16, then plateau. Heat-maps of MAP and nDCG over reveal broad parameter stability. The effectiveness is not tied to a particular choice of dense model for the corpus graph—TCT-ColBERT and TAS-B both yield similarly strong gains. The system is “agnostic” to the graph construction model and insensitive to nuances in fusion hyperparameters (Kulkarni et al., 2024).
7. Extensions and Future Directions
Opportunities for extending LexBoost include:
- Cross-lingual and multilingual retrieval, where dense neighbor links may traverse language boundaries.
- Dynamic per-query tuning of the fusion parameter , potentially as a function of query characteristics.
- Exploiting user history and context by incorporating query nodes or user session context into the corpus graph.
- More efficient graph construction using approximate nearest neighbor indices and/or hardware acceleration (e.g., GPU).
- Extension to other retrieval paradigms (e.g., table or passage-level search) by modifying the neighborhood construction schema.
A plausible implication is that LexBoost’s near-zero query-time cost may enable efficient fusion-based retrieval in resource-constrained or real-time applications, which is difficult for reranking- or embedding-based dense pipelines (Kulkarni et al., 2024).