Generative Retrieval Paradigm

Updated 15 December 2025

Generative Retrieval Paradigm is a novel information retrieval approach that employs autoregressive sequence models to directly generate unique document identifiers, integrating indexing, search, and ranking.
It utilizes specially designed document identifiers—atomic, structured, or semantic—to map queries to documents, enhancing semantic alignment and retrieval efficiency.
Empirical studies demonstrate improved recall and speed on benchmarks, while highlighting challenges in scaling, constrained decoding, and optimizing identifier design.

Generative retrieval is a retrieval paradigm in information retrieval (IR) that unifies indexing, search, and ranking within a single generative sequence model, typically a LLM or encoder-decoder architecture. Rather than encoding documents and queries into fixed vectors or relying on an explicit, external index structure, the generative retrieval model learns to memorize document–identifier mappings and, at inference time, directly autoregressively generates the identifiers (DocIDs) of relevant documents for a given query. This approach displaces the index–retrieve–rank pipeline common to both sparse (BM25) and dense dual-encoder systems with an end-to-end sequence generation task optimized under maximum likelihood or combined ranking objectives (Kuo et al., 3 Jun 2024, Li et al., 23 Apr 2024).

1. Core Formalism and Principle

In the generative retrieval paradigm, every document $d_i$ is assigned a unique identifier, which may be atomic (single token), structured (multi-token), or a natural-language string. Given a user query $q$ , the model defines an autoregressive probability over identifier sequences: $P(\mathrm{DocID}_i\mid q) = \prod_{t=1}^m P(a^i_t \mid a^i_{<t}, q)$ where the identifier $\mathrm{DocID}_i$ is represented as a sequence $[a^i_1,\,a^i_2,\ldots,a^i_m]$ , each token $a^i_t$ drawn from a finite vocabulary or semantic codebook. The whole retrieval system is trained end-to-end to minimize (cross-entropy) loss over a joint corpus of query–document pairs: $\mathcal{L} = -\frac{1}{N}\sum_{j=1}^N \sum_{t=1}^m \log P(a^{i_j}_t \mid a^{i_j}_{<t}, q_j)$ This framework fully integrates the mapping from queries to relevant document identifiers and allows backpropagation and learning to take place directly through the retrieval pipeline (Huang et al., 9 Oct 2025, Kuo et al., 3 Jun 2024).

2. Document Identifier Design and Semantic Compression

Central to generative retrieval is the scheme for document identifiers. Identifiers can be:

Atomic / Numeric: Fixed-length, single-token codes (e.g., one special token per document), enabling direct lookup and efficient ranking via inner products (Nguyen et al., 2023).
Structured Sequence: Sequences of cluster or path tokens, such as those from hierarchical clustering or learned codebooks (e.g., cluster IDs in a K-ary tree) (Nguyen et al., 2023, Sun et al., 2023).
Natural-Language or Semantic: Title strings, sets of key n-grams, term sets, URLs, or pseudo-queries that facilitate semantic alignment between query space and document space (Kuo et al., 3 Jun 2024, Zhang et al., 2023, Lee et al., 2023).

To address identifier misalignment and inflation, especially in cross-lingual scenarios, approaches like cross-lingual semantic compression are introduced (Huang et al., 9 Oct 2025). In MGR-CSC, keywords are extracted across languages, embedded, and clustered to form shared semantic “atoms.” Each document’s keywords are mapped to a fixed-length atom sequence: $\mathrm{DocID}_i = [f(k^1_i), f(k^2_i), \dots, f(k^m_i)]$ Clustering compresses the identifier space (e.g., 74–78% reduction in token length on multilingual corpora), aligns semantics, and enables model sharing across languages.

3. Decoding and Retrieval Algorithms

Retrieval in generative paradigms is driven by constrained autoregressive decoding:

Prefix-tree/Trie constraints: Only valid identifier prefixes are allowed at each decoding step, typically enforced via a beam search subject to the valid DocID space (Sun et al., 2023, Huang et al., 9 Oct 2025).
Dynamic candidate narrowing: At step $t$ , restrict allowed atoms $a_t$ based on the current DocID prefix, drastically reducing search space and GPU softmax overhead (e.g., 90% pruning at each step in MGR-CSC).
Permutation-invariant decoding: For set-based identifiers (e.g., term sets in TSGen), decoding is invariant to the order of terms, enabling resilience against pruning errors— any permutation of the valid terms reconstructs the same document (Zhang et al., 2023).

Pseudocode typical of dynamic constrained multi-step decoding:

for t in range(m):
    A_t = valid_atoms_given_prefix(prefix)
    probs = model.predict_next_token(prefix, q, A_t)
    a_t = argmax(probs)
    prefix.append(a_t)
return prefix

4. Model Architectures and Training Regimes

Generative retrieval is instantiated on various pretrained architectures:

Encoder–Decoder Transformer models (T5, BART): Tokenize queries and output identifier sequences (Kuo et al., 3 Jun 2024, Sun et al., 2023).
Decoder-only LLMs (Llama, mT5): Used in multilingual or purely autoregressive setups (Huang et al., 9 Oct 2025).
Multimodal models: For cross-modal retrieval (as in GRACE), image encoders are paired with generative LLMs to map images to identifiers and “memorize” visual corpora (Li et al., 16 Feb 2024).

Training involves two primary objectives:

Indexing: Learn document-to-identifier mappings from document text (or visual content) to identifier space.
Retrieval (main): Learn query-to-identifier mappings, so that given a query, the relevant DocID is directly generated.
Joint or multi-task objectives: Combine both tasks (or additional pseudo-query generation, ranking, or alignment objectives) for end-to-end optimization (Sun et al., 2023, Pang et al., 2 Apr 2025, Lee et al., 2023).

Augmentation techniques—pseudo-query generation, coverage-promoting identifier augmentation, and co-training query and document encoders—are critical for robust generalization and for scaling to long or structurally complex items (e.g., books (Tang et al., 19 Jan 2025)).

5. Empirical Performance and Analysis

Generative retrieval sets competitive or state-of-the-art results on large-scale IR benchmarks—MS MARCO, Natural Questions, multilingual passage retrieval, e-commerce search, and cross-modal image retrieval. Key empirical findings include:

Recall and ranking improvements: Notable uplift on Recall@1 and Recall@10 (e.g., +6.8% Recall@1 on mMarco100k for MGR-CSC; R4R delivers +2.2–2.5% Hits@1 across several datasets) (Huang et al., 9 Oct 2025, Zhang et al., 15 Oct 2025).
Identifier compression: Cross-lingual semantic clustering reduces DocID token length by >74% without loss of accuracy (Huang et al., 9 Oct 2025).
Efficiency gains: Dynamic constraint decoding and fixed-length identifiers can yield 2x speedup over vanilla sequence-to-sequence retrieval, along with 4x GPU memory reduction due to softmax size contraction (Huang et al., 9 Oct 2025).
Ablation results: Key modules like semantic compression, dynamic decoding, and ranking objectives are all essential; their removal causes degradations of 9–16 Recall@10 points (MGR-CSC), or over 20 points in Hits@1 with omitted structured context in reasoning-augmented retrieval (Huang et al., 9 Oct 2025, Zhang et al., 15 Oct 2025).
Cross-modal scalability: Generative cross-modal retrieval (GRACE) surpasses CLIP for large image corpora (above 150K images), as per-query cost becomes constant, and atomic code identifiers work best though they increase vocabulary size (Li et al., 16 Feb 2024).

6. Extensions: Reasoning, Multilinguality, and Application Domains

Recent work highlights several extensions and challenges:

Reasoning-augmented generative retrieval: R4R introduces explicit, structured reasoning (context/explanation pairs) and retrieval–refinement loops, boosting retrieval accuracy and interpretability over vanilla chain-of-thought methods (Zhang et al., 15 Oct 2025).
Multilingual generalization: Cross-lingual misalignments are mitigated with shared semantic “atom” identifiers and compression (MGR-CSC), enabling alignment and efficiency at multilingual scale (Huang et al., 9 Oct 2025).
Cross-modal retrieval: Generative frameworks such as GRACE and GenIR (for mental image retrieval) extend the paradigm to multimodal corpora, with robust annotation and iterative, visually grounded retrieval (Li et al., 16 Feb 2024, Yang et al., 6 Jun 2025).
E-commerce and structured domains: GRAM and GenR-PO exploit structured, multi-attribute or multi-span identifiers, preference optimization with human click logs, and constrained beam search via FM-index structures for interpretable and high-precision retrieval at scale (Pang et al., 2 Apr 2025, Li et al., 29 Jul 2024).

7. Theoretical Foundations, Limitations, and Future Directions

Analytical and empirical work identifies both advantages and limitations:

Representational capacity: GR is globally normalized and calibrated, providing universal approximation of arbitrary relevance distributions given sufficient model capacity, in contrast to the local normalization and rank bottleneck of dense retrieval (Zhang et al., 26 Sep 2025).
Scalability bottlenecks: Model performance degrades as corpus size grows unless identifier space and parameter count are increased in tandem; dynamic corpora require retraining or sophisticated incremental learning (Kuo et al., 3 Jun 2024).
Decoding challenges: Constrained beam search can be suboptimal, particularly in generalization to unseen corpora; KL-divergence lower bounds demonstrate inevitable error from constraint unawareness, with compounded recall loss in large, diverse document collections (Wu et al., 14 Apr 2025).
Optimization remedies: Hybrid approaches (e.g., GDR combines generative coarse-grained matching with dense fine-grained retrieval), ranking and distillation objectives (LTRGR, DGR), and advanced DocID design/learned tokenization (GenRet, GLEN) drive future progress (Yuan et al., 19 Jan 2024, Li et al., 2023, Lee et al., 2023, Sun et al., 2023, Pang et al., 2 Apr 2025).
Research directions: Improved scalability, identifier learning, continual and retrieval-aware pretraining, reasoning integration, unified retrieval+generation modeling, and efficient, interpretable decoding rank as central open challenges (Kuo et al., 3 Jun 2024, Li et al., 23 Apr 2024, Zhang et al., 15 Oct 2025).

References:

(Huang et al., 9 Oct 2025) Multilingual Generative Retrieval via Cross-lingual Semantic Compression
(Zhang et al., 15 Oct 2025) Retrieval-in-the-Chain: Bootstrapping LLMs for Generative Retrieval
(Li et al., 16 Feb 2024) Generative Cross-Modal Retrieval: Memorizing Images in Multimodal LLMs for Retrieval and Beyond
(Pang et al., 2 Apr 2025) Generative Retrieval and Alignment Model: A New Paradigm for E-commerce Retrieval
(Sun et al., 2023) Learning to Tokenize for Generative Retrieval
(Kuo et al., 3 Jun 2024) A Survey of Generative Information Retrieval
(Nguyen et al., 2023) Generative Retrieval as Dense Retrieval
(Wu et al., 14 Apr 2025) Constrained Auto-Regressive Decoding Constrains Generative Retrieval
(Li et al., 2023) Learning to Rank in Generative Retrieval
(Zhang et al., 15 Oct 2025) Retrieval-in-the-Chain: Bootstrapping LLMs for Generative Retrieval
(Zhang et al., 2023) Generative Retrieval via Term Set Generation
(Lee et al., 2023) GLEN: Generative Retrieval via Lexical Index Learning
(Tang et al., 19 Jan 2025) Generative Retrieval for Book search
(Yang et al., 6 Jun 2025) GenIR: Generative Visual Feedback for Mental Image Retrieval
(Li et al., 29 Jul 2024) Generative Retrieval with Preference Optimization for E-commerce Search
(Li et al., 23 Apr 2024) From Matching to Generation: A Survey on Generative Information Retrieval
(Reusch et al., 25 Mar 2025) Reverse-Engineering the Retrieval Process in GenIR Models