Beyond Precision: A Study on Recall of Initial Retrieval with Neural Representations

Published 28 Jun 2018 in cs.IR | (1806.10869v2)

Abstract: Vocabulary mismatch is a central problem in information retrieval (IR), i.e., the relevant documents may not contain the same (symbolic) terms of the query. Recently, neural representations have shown great success in capturing semantic relatedness, leading to new possibilities to alleviate the vocabulary mismatch problem in IR. However, most existing efforts in this direction have been devoted to the re-ranking stage. That is to leverage neural representations to help re-rank a set of candidate documents, which are typically obtained from an initial retrieval stage based on some symbolic index and search scheme (e.g., BM25 over the inverted index). This naturally raises a question: if the relevant documents have not been found in the initial retrieval stage due to vocabulary mismatch, there would be no chance to re-rank them to the top positions later. Therefore, in this paper, we study the problem how to employ neural representations to improve the recall of relevant documents in the initial retrieval stage. Specifically, to meet the efficiency requirement of the initial stage, we introduce a neural index for the neural representations of documents, and propose two hybrid search schemes based on both neural and symbolic indices, namely the parallel search scheme and the sequential search scheme. Our experiments show that both hybrid index and search schemes can improve the recall of the initial retrieval stage with small overhead.

Abstract PDF Upgrade to Chat

Authors (6)

Citations (2)

View on Semantic Scholar

Summary

The paper introduces two hybrid search schemes (ParSearch and SeqSearch) that leverage neural representations to address vocabulary mismatch in IR.
It employs a k-nearest-neighbor graph built with TF-IDF weighted word embeddings to enable efficient semantic search in high-dimensional spaces.
Experimental evaluations on TREC datasets demonstrate that the sequential scheme significantly outperforms traditional BM25 in terms of recall.

Study on Recall of Initial Retrieval with Neural Representations

This paper investigates employing neural representations to address the vocabulary mismatch problem in information retrieval (IR) systems. It aims to increase the recall at the initial retrieval stage, which is crucial as missing relevant documents early on would preclude their consideration in later re-ranking stages.

Problem Formulation and Approach

Traditional IR systems use symbolic indices, like BM25 with inverted indices, which might miss relevant documents due to vocabulary mismatch. Recent research in neural representations promises to alleviate this issue by generating semantic embeddings of documents and queries. This study introduces a novel method to employ neural representations at the initial retrieval stage to improve recall.

Two hybrid search schemes are proposed:

Parallel Search Scheme (ParSearch): This scheme conducts searches using symbolic and neural indices simultaneously, merging the results to form the candidate subset.
Figure 1: Symbolic representation and index.
Sequential Search Scheme (SeqSearch): Initially retrieves seed documents using symbolic indices and expands the result set by associating semantically similar documents using a neural index.
Figure 2: Neural representation and index.

Implementation Details

The researchers address two challenges: indexing neural representations and efficiently searching them. They utilize a $k$ -nearest-neighbor ( $k$ -NN) graph as the neural index, ensuring efficient operations in high-dimensional semantic spaces.

For implementation:

The neural index uses TF-IDF weighted word embeddings.
The $k$ -NN graph facilitates searching based on cosine similarity.
ParSearch retrieves top documents from both symbolic and neural indices and merges them.
SeqSearch employs a symbolic search to find seeds and uses neural indices to expand the subset semantically.
Figure 3: Parallel search scheme.

Figure 4: Sequential search scheme.

Experimental Evaluation

The proposed schemes are tested on two TREC collections, Robust04 and WT2G, focusing on recall@1000 and time efficiency. Key findings include:

SeqSearch demonstrates superior recall performance compared to ParSearch by filtering noise through structured semantic expansions.
Both search schemes enhance recall over traditional BM25, with SeqSearch's selective expansion proving effective.

Implications and Future Work

The study provides a foundation for integrating neural representations efficiently in initial retrieval phases of search engines. By achieving higher recall with minimal overhead, these methods hold promise for practical IR systems.

Future work may explore non-metric similarity functions within $k$ -NN graphs and further improve semantic matching's precision. Additionally, other sequential strategies could be examined to optimize efficiency further.

Conclusion

This research illustrates the potential of neural representations in the initial retrieval stage by addressing vocabulary mismatch problems, enhancing recall with practical efficiencies. Both hybrid search schemes presented could inspire further exploration and refinement in the domain of intelligent information retrieval systems.

Markdown Report Issue