Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 80 tok/s
Gemini 2.5 Pro 60 tok/s Pro
GPT-5 Medium 23 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 87 tok/s Pro
Kimi K2 173 tok/s Pro
GPT OSS 120B 433 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Generative Retrieval: Docid Generation

Updated 29 September 2025
  • Generative Retrieval is an information retrieval paradigm that uses autoregressive language models to encode corpus knowledge and directly generate document identifiers.
  • It offers efficient indexing, reduced storage requirements, and up to 10× lower computational cost per query by internalizing document representations.
  • Empirical evaluations show that GR consistently outperforms dual encoder systems on dynamic corpora, maintaining robust performance with minimal catastrophic forgetting.

Generative Retrieval (GR) is an information retrieval paradigm in which an autoregressive LLM encodes corpus knowledge into its model parameters and retrieves relevant documents by directly generating their identifiers (docids), rather than computing term-based or embedding-based similarity scores as in traditional sparse and dense retrieval systems. This approach decouples both indexing and retrieval from external vector stores, instead leveraging the generation capacity and internal memory of large sequence models, and operates via end-to-end training to predict the most relevant identifier sequences for input queries. Recent research has focused on the unique representational, efficiency, and adaptability properties of GR, especially in dynamic corpus scenarios, and explored its relationships and trade-offs with classic dual encoder and dense retrieval models.

1. Paradigm and Theoretical Foundations

Generative Retrieval is formulated as a sequence-to-sequence autoregressive generation task, in which the LLM learns to output valid docids (numerical, symbolic, or text-based) for a given query qq. During training, the model minimizes a negative log-likelihood objective over all valid docid sequences, formalized as

LGR(Θ)=Eq[logPΘ(d+q)]=Eq[t=1LlogpΘ(yt+y1..t1+,q)],\mathcal{L}_{GR}(\Theta) = \mathbb{E}_q \left[ -\log P_\Theta(d^+|q) \right] = \mathbb{E}_q \left[ -\sum_{t=1}^{L} \log p_\Theta(y^+_t | y^+_{1..t-1}, q) \right],

with normalization over the full docid space. Unlike dense retrieval (DR), which minimizes a locally normalized cross-entropy over a small candidate set, GR's objective is globally normalized and not subject to calibration drift as the corpus and negative sample sizes grow (Zhang et al., 26 Sep 2025). This global normalization facilitates direct modeling of the true conditional distribution of relevant docids given the query.

Representationally, DR models are constrained by the rank of the matrix product S=QDTS = QD^T with fixed-size embeddings (rank at most rr), while GR can, in principle, encode an arbitrary query-document relevance matrix by leveraging a sufficiently expressive decoder and parameterization. Theoretical analysis confirms that GR has unbounded representational capacity with respect to parameter size, whereas DR capacity plateaus (Zhang et al., 26 Sep 2025).

2. Practical Adaptability and Robustness

GR models demonstrate strong practical adaptability and robustness in environments where the document collection evolves and temporal information is present. For dynamic corpora (as in the StreamingQA benchmark), state-of-the-art GR models such as SEAL and MINDER show a +13% to +18% increase in hit@5 for new documents over dual encoders, both when applying parameter-updating and simple index extension strategies (Kim et al., 2023). GR exhibits a markedly lower degree of catastrophic forgetting: performance degradation for earlier corpus segments remains around 1.23%—compared to 3.2% for DE.

A key robustness advantage lies in temporal generalization. While DE models are susceptible to spurious cues such as timestamps (evidence: a 2–3× performance drop when timestamps are hidden), GR performance remains stable, demonstrating less reliance on lexical overlap and greater semantic resilience (Kim et al., 2023).

3. Efficiency: Indexing, Storage, and Computational Complexity

GR provides substantial advantages in indexing and storage efficiency. Indexing time for GR is greatly reduced compared to DE: 2.7 hours (base corpus) versus 18.9 hours for DE, and 3.1 hours (updated corpus) versus 20.4 hours for DE—for comparable dataset scales (Kim et al., 2023). Storage footprints are also much smaller, since GR does not maintain explicit document vector stores; instead, all knowledge is encoded internally (e.g., 29 GB for GR versus 127 GB for DE systems). Only a compact FM-index or trie-based docid structure may be needed.

During inference, GR operates at constant time per query, O(1)\mathcal{O}(1), independent of the corpus size, as opposed to the O(N)\mathcal{O}(N) cost of DE inner-product matching. The computational complexity for DE is given by: DEflops=C×IP,IP=dmodel×(dmodel1)\mathrm{DE}_{\text{flops}} = C \times I_P,\quad I_P = d_{\mathrm{model}} \times (d_{\mathrm{model}}-1) For GR, the complexity is

GRflops=FWflops+L×Beamflops\mathrm{GR}_{\text{flops}} = \mathrm{FW}_{\text{flops}} + L \times \mathrm{Beam}_{\text{flops}}

where

FWflops=2N+2nlayernctxdattn\mathrm{FW}_{\text{flops}} = 2N + 2 n_{\text{layer}} n_{\text{ctx}} d_{\text{attn}}

and

Beamflops=FWflops×IP×VlogV×B\mathrm{Beam}_{\text{flops}} = \mathrm{FW}_{\text{flops}} \times I_P \times |V| \log |V| \times B

for beam size BB, vocabulary size V|V|, and layer/context/attention dimensions as in transformer architectures. In practice, GR achieves up to 10× lower computational cost per query than DE (Kim et al., 2023).

Although latency can be marginally higher for GR (due to token-wise generation not being as easily batched as DE with ANN libraries such as FAISS), these efficiency gains make GR competitive for real-world, large-scale, dynamically changing retrieval systems.

4. Empirical Performance in Static and Dynamic Settings

Comprehensive benchmarking on StreamingQA reveals that GR models (MINDER, SEAL) achieve around 35–38% hit@5 in static settings, surpassing the 16–20% observed for DE baselines such as Spider and Contriever (Kim et al., 2023). In dynamic corpora, GR maintains or even slightly improves Hit@5 as new documents are incorporated, while DE models not only lose performance on novel content but also exhibit pronounced overfitting to temporal/lexical signals.

Crucially, GR does not suffer from the performance gap between queries aimed at the base versus the expanded corpus—a persistent issue for DE in dynamic scenarios. This consistent accuracy across corpus updates confirms that the internalized parametric memory and robust generalization properties of GR provide a more stable retrieval solution in evolving environments.

5. Design Trade-offs and Deployment Considerations

While GR has clear strengths in computational efficiency, storage, indexing latency, and retrieval robustness, various deployment trade-offs remain. DE approaches can achieve lower real-time latency on identical hardware using highly optimized ANN libraries with batch parallelism, particularly when leveraging GPU-based acceleration (Kim et al., 2023). However, GR's independence from corpus size in computation scaling and its lower memory footprint make it attractive as the collection grows.

In index update workflows, GR needs only to retrain or continue training on new data (if employing continual learning or incremental product quantization schemes), rather than recomputing entire vector indexes. Nevertheless, practical systems must manage issues such as error propagation in autoregressive generation, constrained decoding to ensure valid docid generation, and balancing retrievability with compactness and coverage when designing docid spaces.

The model size must also be matched to corpus scale, as underparameterized GR can underperform DE due to insufficient memorization, while overparameterized models may face marginal efficiency returns.

6. Future Directions

Emerging research in GR focuses on hybridizing memory-based and matching-based retrieval, multi-stage architectures (using coarse GR for fast narrowing followed by fine-grained dense reranking), dynamic docid generation strategies, and adaptive continual learning for rapidly changing corpora. Practical scaling to millions or billions of documents, robust handling of incremental corpus updates, and maintaining retrieval quality in resource-constrained deployments are central open challenges.

The outlined findings consistently support that generative retrieval is highly adaptable, robust to knowledge evolution, and computationally efficient—achieving lower indexing time, reduced storage, and constant inference complexity—also presenting strong empirical retrieval performance on dynamic QA benchmarks. GR represents a compelling alternative to dual encoder paradigms for large-scale, production-grade IR systems where continual document flow and efficiency are paramount (Kim et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Generative Retrieval (GR).