Replication and Exploration of Generative Retrieval over Dynamic Corpora (2504.17519v1)

Published 24 Apr 2025 in cs.IR

Abstract: Generative retrieval (GR) has emerged as a promising paradigm in information retrieval (IR). However, most existing GR models are developed and evaluated using a static document collection, and their performance in dynamic corpora where document collections evolve continuously is rarely studied. In this paper, we first reproduce and systematically evaluate various representative GR approaches over dynamic corpora. Through extensive experiments, we reveal that existing GR models with \textit{text-based} docids show superior generalization to unseen documents. We observe that the more fine-grained the docid design in the GR model, the better its performance over dynamic corpora, surpassing BM25 and even being comparable to dense retrieval methods. While GR models with \textit{numeric-based} docids show high efficiency, their performance drops significantly over dynamic corpora. Furthermore, our experiments find that the underperformance of numeric-based docids is partly due to their excessive tendency toward the initial document set, which likely results from overfitting on the training set. We then conduct an in-depth analysis of the best-performing GR methods. We identify three critical advantages of text-based docids in dynamic corpora: (i) Semantic alignment with LLMs' pretrained knowledge, (ii) Fine-grained docid design, and (iii) High lexical diversity. Building on these insights, we finally propose a novel multi-docid design that leverages both the efficiency of numeric-based docids and the effectiveness of text-based docids, achieving improved performance in dynamic corpus without requiring additional retraining. Our work offers empirical evidence for advancing GR methods over dynamic corpora and paves the way for developing more generalized yet efficient GR models in real-world search engines.

Summary

Replication and Exploration of Generative Retrieval over Dynamic Corpora

In the field of Information Retrieval (IR), the paper "Replication and Exploration of Generative Retrieval over Dynamic Corpora" explores the viability and intricacies of Generative Retrieval (GR) models. Traditionally, IR systems relied on static document collections, applying separate optimization modules like "index-retrieval-then-rank." However, the emergence of the GR paradigm challenges this conventional approach by offering end-to-end training on dynamic corpora, prompting discussions on its adaptability and efficiency when faced with continuously evolving document collections.

Generative Retrieval, at its core, transforms document retrieval tasks by encoding all corpus information into generative LLMs. These models autoregressively generate document identifiers (docids) for queries, diverging from the traditional sparse or dense retrieval approaches. The paper presents an empirical analysis that extends across various GR model architectures, contrasting their performance and adaptability over dynamic corpora where document collections are not fixed.

Strong Numerical Insights and Claims

The paper is meticulous in its approach, offering a systematic reproduction of representative GR methodologies in scenarios that mimic real-world document evolution. Through extensive experiments, the paper substantiates that GR models employing text-based docids, such as SEAL, exhibit superior generalization to unseen documents compared to their numeric counterparts. Remarkably, text-based docids demonstrate performance even comparable to dense retrieval methods like DPR-HN in dynamic settings. This is noteworthy given that numeric-based docids, though efficient in static environments, underperform dramatically due to bias and overfitting tendencies towards initial training data—an insight reflected in poorer generalization metrics.

These findings are supported by the introduction and analysis of the Initial Document Bias Index (IDBI), which quantifies retrieval bias in dynamic corpora scenarios. GR models using numeric-based docids exhibit significant retrieval bias towards initial documents, underscoring a semantic gap between new documents and previously encountered docids. This analytical approach highlights key inefficiencies in numeric-based methods, paving the way for enhanced methodological frameworks in GR.

Implications for Future Developments

The insights provided by this paper have profound implications for future research and development in IR systems, particularly in real-world applications demanding robust dynamic adaptability. The integration of text-based docid systems reveals pathways for designing GR models that can efficiently adapt without retraining, a critical requirement for scalable deployment in environments like search engines where document collections evolve continuously.

Moreover, the multidimensional analysis proposed—the exploration of docid granularity, lexical diversity, and semantic familiarity—emphasizes areas ready for further exploration. Developing hybrid or adaptive GR models that leverage both numeric and text-based principles could offer optimization in terms of both performance and computational efficiency, a direction the paper starts to explore through its Multi-Docid Generative Retrieval (MDGR) framework.

Conclusion

Overall, the paper's examination of GR models over dynamic corpora offers valuable insights into optimizing IR systems for evolving document environments. By meticulously exploring the roles of docid types and designs, and verifying these through comprehensive empirical evidence, this work enhances our understanding and creates a paradigm for developing next-generation, adaptable, and efficient GR models. The implications draw attention to exploring robust hybrid systems that could balance the efficiency of numeric-based designs with the adaptability of text-based frameworks—a promising avenue for scalability in information retrieval applications.