Generative Retrieval in IR
- Generative Retrieval is an IR paradigm that uses sequence-to-sequence models to directly generate document identifiers and evidential units without relying on external indexes.
- It employs transformer-based encoder-decoder architectures to dynamically model dependencies among evidences, ensuring non-redundant and context-aware retrieval.
- Empirical results show that generative retrieval achieves significant memory and latency improvements, supporting flexible evidence aggregation for fact verification.
Generative retrieval is a paradigm in information retrieval (IR) in which a sequence-to-sequence (seq2seq) neural model is trained to directly generate document identifiers or evidential units relevant to an input query, superseding classic index–retrieve–rank architectures. Rather than relying on external inverted indexes or dense vector stores, generative retrieval models internalize corpus knowledge within their parameters and perform retrieval through auto-regressive sequence generation. This approach enables dynamic, dependency-aware evidence selection and exhibits distinct advantages and associated challenges in accuracy, efficiency, and scalability.
1. Principle of Generative Retrieval
Generative retrieval reframes the retrieval objective as conditional sequence generation. Given an input query—such as a claim for fact verification, a search question, or an open-domain query—the model is trained to generate a sequence of tokens that uniquely identifies or describes relevant documents (e.g., Wikipedia titles) and evidential units (such as sentence numbers). This process is typically realized using a transformer-based encoder–decoder architecture.
Formally, for a claim , the model encodes the input as , then decodes document titles and evidence sentence identifiers in a sequential fashion:
- For document titles:
- For evidence sentences:
Each decoding stage models dependencies on previously generated outputs, enabling the system to coordinate document and sentence selection dynamically. The full objective is joint:
where each term is a negative log-likelihood over the respective generative targets.
2. Departures from Classical Retrieval Paradigms
Traditional IR pipelines, especially in fact verification and QA, follow a staged approach: document retrieval sentence retrieval claim/passage verification, each relying on indexes or dual-encoder/dense matching models with separate scoring. This architecture imposes several limitations:
- Large document and sentence indexes must be constructed and maintained, imposing substantial memory and computational overhead—e.g., DPR index size of 70.9GB versus GERE's 2.1GB (Chen et al., 2022).
- Independent ranking fails to capture inter-dependencies among evidential units.
- Fixed selection (top-) bounds evidence sets, reducing flexibility.
Generative retrieval eliminates explicit indexing, replaces multi-stage scoring with sequence generation, and enables variable-length evidence output as required per query. This provides memory and time efficiency while supporting more nuanced evidence aggregation.
Classical Pipeline | Generative Retrieval (GERE) |
---|---|
Large static indexes | Indexless; knowledge in model parameters |
Staged, independent selection | Joint, dependency-aware sequence generation |
Fixed result sets | Dynamic number of evidences per query |
3. Sequential Generation and Evidence Coordination
A central feature of generative retrieval as instantiated by GERE (Chen et al., 2022) is the modeling of dependencies via sequential decoding. The title decoder outputs a sequence of document titles, with each decision dependent on both the claim and titles already generated. This allows for:
- Avoiding redundancy (decoding can track previously chosen titles).
- Conditioning on coverage (subsequent generations can offset limitations of prior decisions).
- Modeling complex interactions (e.g., evidence for and against a claim in multi-evidence tasks).
The evidence decoder subsequently generates fine-grained sentence identifiers from the candidate document pool, also sequentially, tracking previously selected evidence units to ensure diverse, non-redundant support.
This contrasts with earlier bi-encoder ranking schemes, which score each candidate independently and do not propagate contextual dependencies among retrieved evidences.
4. Memory and Computational Efficiency
Generative retrieval offers significant efficiency gains. With model size scaling linearly with vocabulary (rather than number of documents) and direct generation in lieu of full-corpus ranking, both memory usage and inference latency are sharply reduced.
Experimental results on the FEVER dataset (Chen et al., 2022) show:
- GERE's memory footprint: ~2.1GB, compared to DPR's 70.9GB.
- Inference time: GERE 5.35ms per claim, NSMN 28.51ms per claim.
- With an average of 1.91 documents and 2.42 sentences retrieved per claim, GERE achieves higher precision despite a potentially lower recall than fixed- baselines.
This suggests the paradigm is especially attractive for deployment in resource-constrained or latency-sensitive environments.
5. Empirical Results and Fact Verification
Empirical evaluation with GERE (Chen et al., 2022), specifically on the FEVER dataset, demonstrates:
- Document F1: 81.10, outperforming dense retrieval baselines such as RAG.
- Superior precision in document and sentence retrieval, due to dynamic evidence set sizes and reduction of irrelevant evidence.
- Downstream improvements: When downstream claim verification models (e.g., BERT Concat, GAT, KGAT) are fed GERE-generated evidences, both label accuracy and FEVER score increase, indicating that the quality and consistency of generatively retrieved evidence supports stronger reasoning.
These quantitative results underpin the argument that generative retrieval enables effective, context-aware evidence collection, improving end-to-end reasoning in fact verification systems.
6. Deployment Considerations and Limitations
While generative retrieval exhibits considerable promise, several considerations remain:
- The quality of generation remains bounded by the pretraining and fine-tuning of the underlying encoder–decoder architecture; catastrophic forgetting or domain shift may reduce performance in out-of-domain scenarios.
- Achieving high recall can be challenging in cases where relevant documents are not easily decodable from the model's representation.
- Constrained beam search is typically required during decoding (forcing output to valid docIDs or titles), necessitating a prefix tree or similar structure for efficiency and correctness.
Despite these caveats, the model's flexibility in evidence number, its joint modeling of dependencies, and its reduced computational burden position generative retrieval as a principled alternative to classic index-based methods in scenarios demanding rapid, coordinated evidence assembly.
7. Theoretical and Practical Implications
The generative retrieval process as articulated by GERE (Chen et al., 2022) advances IR methodology in several respects:
- Unifies the evidence retrieval process: no longer staged, but realized as a single generative act, tightly coupled with the reasoning task.
- Demonstrates the feasibility and empirical superiority of indexless, sequence-based retrieval in knowledge-intensive applications.
- Provides a foundation for future work—integrating reinforcement learning, hybrid generative–retrieval architectures, or extending generative retrieval to multi-modal or non-textual corpora.
This paradigm underscores a shift towards holistic, dependency-aware retrieval systems that are both efficient and capable of producing context-sensitive, multi-unit evidential support per query.