DyVo: Dynamic Vocabularies for Learned Sparse Retrieval with Entities (2410.07722v2)

Published 10 Oct 2024 in cs.IR

Abstract: Learned Sparse Retrieval (LSR) models use vocabularies from pre-trained transformers, which often split entities into nonsensical fragments. Splitting entities can reduce retrieval accuracy and limits the model's ability to incorporate up-to-date world knowledge not included in the training data. In this work, we enhance the LSR vocabulary with Wikipedia concepts and entities, enabling the model to resolve ambiguities more effectively and stay current with evolving knowledge. Central to our approach is a Dynamic Vocabulary (DyVo) head, which leverages existing entity embeddings and an entity retrieval component that identifies entities relevant to a query or document. We use the DyVo head to generate entity weights, which are then merged with word piece weights to create joint representations for efficient indexing and retrieval using an inverted index. In experiments across three entity-rich document ranking datasets, the resulting DyVo model substantially outperforms state-of-the-art baselines.

Summary

The paper introduces a Dynamic Vocabulary head that integrates Wikipedia entities with traditional word pieces to enhance learned sparse retrieval.
It employs a generative entity retrieval mechanism using LLMs to boost precision and recall compared to conventional entity linking methods.
The model achieves significant performance gains on benchmark datasets like TREC, demonstrating its effectiveness for entity-rich queries.

Essay: Dynamic Vocabularies for Enhanced Learned Sparse Retrieval

This paper introduces DyVo, a novel approach designed to enhance Learned Sparse Retrieval (LSR) by incorporating dynamic vocabularies enriched with entities. LSR models traditionally face challenges with vocabularies derived from pre-trained transformers, which often fragment crucial entities into nonsensical components, thereby limiting retrieval accuracy and the model's integration of current world knowledge. The DyVo model effectively addresses these limitations through the integration of Wikipedia concepts, enabling improved ambiguity resolution and knowledge updates.

Methodology and Contributions

The core innovation is the Dynamic Vocabulary (DyVo) head, which merges vocabulary from traditional word pieces with a vast array of Wikipedia entities. This integration is facilitated by existing entity embeddings and a candidate retrieval mechanism that identifies relevant entities in the context of a given query or document.

The DyVo head employs these entities to generate specific weights, which are then combined with traditional word piece weights. This fusion creates joint representations that are indexed and retrieved efficiently using inverted indexing methods. The model's effectiveness is demonstrated across multiple entity-rich document ranking datasets, achieving significant performance improvements in comparison with state-of-the-art baselines. The methodology increases both the nDCG and Recall metrics across these datasets, indicating enhanced precision and coverage in retrieval on entity-heavy queries.

Key contributions include:

The Dynamic Vocabulary Head: This feature allows LSR models to be extended from traditional word piece vocabularies to include millions of entities. It leverages both pre-existing entity embeddings and a retrieval component for refined candidate selection.
Generative Entity Retrieval: They introduce a few-shot approach for generating entity candidates using LLMs like Mixtral and GPT4. This approach provides significant improvements in accuracy when compared with traditional methods such as entity linking and sparse/dense retrievers.
Evaluation on Benchmark Datasets: The DyVo model consistently outperforms baseline methods on three disparate datasets, including TREC Robust04 and TREC Core 2018, demonstrating its broad applicability and robustness across domains.

Implications and Future Directions

The implications of this work are significant for both practical applications and future research directions in Information Retrieval (IR). By dynamically incorporating entities into LSR, the DyVo model ensures more accurate and contextually relevant retrieval, crucial for applications like search engines where understanding the user's intent and retrieving semantically coherent documents is critical.

Practically, DyVo can be integrated into real-world search systems requiring efficient and precise retrieval capabilities, especially where entity-rich queries are prevalent. The model's architecture is designed to be memory-efficient, mitigating the challenges posed by the integration of large-scale vocabularies.

Theoretically, DyVo's approach of combining generative entity retrieval iteratively with sparse lexical representations paves the way for further explorations into hybrid models that blend different types of retrieval strategies. For AI developments, this work suggests a path towards more intelligent systems capable of nuanced understanding and representation of comprehensive vocabularies that better reflect real-world complexity.

Overall, this paper's contributions mark a step forward for the retrieval community, providing a framework that accommodates ongoing changes and nuances in language and entities across various domains and contexts.