Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Neural Corpus Indexer for Document Retrieval (2206.02743v3)

Published 6 Jun 2022 in cs.IR

Abstract: Current state-of-the-art document retrieval solutions mainly follow an index-retrieve paradigm, where the index is hard to be directly optimized for the final retrieval target. In this paper, we aim to show that an end-to-end deep neural network unifying training and indexing stages can significantly improve the recall performance of traditional methods. To this end, we propose Neural Corpus Indexer (NCI), a sequence-to-sequence network that generates relevant document identifiers directly for a designated query. To optimize the recall performance of NCI, we invent a prefix-aware weight-adaptive decoder architecture, and leverage tailored techniques including query generation, semantic document identifiers, and consistency-based regularization. Empirical studies demonstrated the superiority of NCI on two commonly used academic benchmarks, achieving +21.4% and +16.8% relative enhancement for Recall@1 on NQ320k dataset and R-Precision on TriviaQA dataset, respectively, compared to the best baseline method.

A Neural Corpus Indexer for Document Retrieval

The paper introduces a sophisticated approach towards enhancing the effectiveness of document retrieval systems through a technique known as Neural Corpus Indexer (NCI). As traditional index-retrieve paradigms face constraints in optimization towards retrieval targets, this novel method emphasizes on an end-to-end deep learning model that integrates both training and indexing stages. The authors focus on improving recall performance, a critical element for efficient web search engines, by connecting the dots between query generation and semantic understanding within document identifiers.

NCI employs a sequence-to-sequence architecture designed for the direct generation of document identifiers corresponding to queries, rendering a notable departure from inverted index and dense retrieval methodologies. The paper accentuates the contrast between term-based retrieval techniques prone to semantic mismatches and the limited capacity of sparse embedding vectors typical of semantic based approaches. Addressing these drawbacks, NCI enhances document retrieval by fully leveraging the potential of deep neural networks to grasp the complexity of query-document interactions.

Empirical evaluations on standard datasets, NQ320kk and TriviaQA, demonstrate that NCI significantly outperforms existing benchmarks by margins of 21.4% in Recall@1 and 16.8% in R-Precision, respectively. These improvements are attributed to various tailored techniques including a prefix-aware weight-adaptive decoder architecture, augmented query-document pair training, semantic document identifiers created via hierarchical kk-means, and consistency-based regularization.

The implications of this research are manifold. From a theoretical perspective, NCI suggests pathways for integrating task-specific enhancements into end-to-end learning frameworks, thus fostering more efficient semantic representations. Practically, the application of such a model could streamline search engine operations by encapsulating the retrieval and ranking processes within a singular differentiable framework, promising enhancements in speed and maintenance ease.

Future trajectories in artificial intelligence could build on this foundation, exploring larger model capacities, real-time retrieval applications, and swift update mechanisms for indexer models. Moreover, the possibility of creating a unified system architecture marrying retrieval and ranking components into an undivided network could signify a substantial step towards the evolution of next-generation search engines.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (16)
  1. Yujing Wang (53 papers)
  2. Yingyan Hou (9 papers)
  3. Haonan Wang (84 papers)
  4. Ziming Miao (8 papers)
  5. Shibin Wu (6 papers)
  6. Hao Sun (383 papers)
  7. Qi Chen (194 papers)
  8. Yuqing Xia (12 papers)
  9. Chengmin Chi (1 paper)
  10. Guoshuai Zhao (12 papers)
  11. Zheng Liu (312 papers)
  12. Xing Xie (220 papers)
  13. Hao Allen Sun (1 paper)
  14. Weiwei Deng (29 papers)
  15. Qi Zhang (785 papers)
  16. Mao Yang (62 papers)
Citations (120)
Youtube Logo Streamline Icon: https://streamlinehq.com