Dice Question Streamline Icon: https://streamlinehq.com

Scaling Generative Search to Millions of Passages

Develop scalable generative search methods—particularly numeric ID-based approaches such as Differentiable Search Index (DSI)—that can operate effectively on corpora containing millions of passages, achieving competitive retrieval performance and efficiency relative to dual-encoder baselines.

Information Square Streamline Icon: https://streamlinehq.com

Background

Within the discussion of numeric ID identifiers for generative search, the survey highlights significant challenges related to generalization, corpus updates, and scaling to large corpora. Numeric ID-based approaches require the LLM to memorize document–ID associations, which becomes increasingly difficult as corpus size grows.

The cited scaling paper reports that while generative search can be competitive on small corpora, extending to millions of passages remains unresolved, underscoring a core limitation of current identifier and training strategies in large-scale retrieval settings.

References

It is found that while generative search is competitive with state-of-the-art dual encoders on small corpora, scaling to millions of passages remains an important and unsolved challenge.

A Survey of Generative Search and Recommendation in the Era of Large Language Models (2404.16924 - Li et al., 25 Apr 2024) in Section 4.2, Document Identifiers (Numeric ID)