Integrating Chain-of-Thought into Generative Retrieval: A Preliminary Study

Published 21 May 2026 in cs.IR | (2605.22358v1)

Abstract: While generative retrieval (GR) demonstrates competitive performance on standard retrieval benchmarks, existing approaches directly map queries to document identifiers (docids) without intermediate deliberation, limiting their effectiveness for complex queries that require multi-step reasoning. As a preliminary study on integrating chain-of-thought (CoT) into generative retrieval, we introduce ThinkGR, a unified framework that interleaves CoT with docid generation, enabling iterative thinking and retrieval within a single generative process. To bridge the gap between free-form thought generation and structured retrieval targets, we design (1) a hybrid decoding strategy that dynamically switches between unconstrained thought generation and constrained docid decoding, and (2) a two-phase training approach that first aligns thought-retrieval patterns through supervised fine-tuning, then optimizes thought quality via retrieval-grounded reinforcement learning. Experiments on four multi-hop retrieval benchmarks demonstrate that ThinkGR achieves state-of-the-art performance with an average improvement of +6.86\%. Our work opens new avenues for enhancing generative retrieval with explicit deliberation capabilities, with promising implications for retrieval tasks requiring complex reasoning.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces ThinkGR, which interleaves chain-of-thought reasoning with document retrieval to improve multi-hop query performance.
It employs a semantic triple representation and a two-phase training strategy combining supervised fine-tuning and KTO reinforcement learning.
Experimental results show a 6.86% average recall improvement and 6.37% higher QA accuracy, demonstrating robustness in complex retrieval tasks.

Integrating Chain-of-Thought into Generative Retrieval: Technical Analysis of ThinkGR

Motivation and Context

Recent progress in generative retrieval (GR) has reframed document retrieval as a sequence generation task, directly mapping queries to document identifiers (docids) using LLM architectures. While effective for standard benchmarks, these approaches lack explicit deliberation mechanisms, impeding performance on complex queries requiring multi-step semantic reasoning. For multi-hop retrieval—where resolving a query necessitates traversing interdependent facts scattered across documents—this limitation is critical.

Inspired by advances in chain-of-thought (CoT) prompting and deliberative reasoning in LLMs, this paper proposes ThinkGR: a unified, autoregressive generative retrieval framework that explicitly interleaves reasoning ("thought") with retrieval within a single generation process. This design eliminates the need for separate LLM-retriever modules or multi-stage inference and enables end-to-end optimization of both reasoning and retrieval actions.

ThinkGR Methodology

Semantic Triple Representation for Docids

Unlike traditional GR methods using opaque or purely lexical docids, ThinkGR represents documents as knowledge triples $(\text{head entity}, \text{relation}, \text{tail entity})$ . This choice brings two technical advantages: (1) enabling semantic traversal via explicit, interpretable relations, and (2) aligning the autoregressive LLM's generation space with the structure of retrieval targets, facilitating generalization to unseen entities and relations.

The curation pipeline constructs semantic triples using LLM-based extraction, with additional filtering for factual and format correctness, resulting in high-quality SFT data. This semantic, structured representation directly supports multi-hop reasoning, mitigating the semantic gap in complex queries.

Two-Phase Training Strategy

Thought-Retrieval Alignment (Supervised Fine-Tuning, SFT): The model is fine-tuned to generate interleaved sequences of thought tokens and docids, minimizing negative log-likelihood over sequences of the form $(r_1, d_1, r_2, d_2, \ldots)$ , where each $r_i$ informs and contextualizes the subsequent retrieval $d_i$ . This aligns generation patterns with intended workflows.
Retrieval-Grounded Thought Optimization (KTO Reinforcement Learning): Supervised alignment is insufficient for producing optimal reasoning strategies beyond imitation. Hence, the authors introduce Kahneman-Tversky Optimization (KTO) for reinforcement learning, using retrieval accuracy as an automatic, retrieval-grounded reward. Desirable/undesirable responses are partitioned by a recall-based threshold; the prospect-theoretic KTO objective then updates policy weights to reinforce trajectories (thought sequences) leading to correct retrieval, directly tying the quality of generated reasoning to tangible downstream utility.

Hybrid Decoding: Unified Inference Across Thought and Retrieval

ThinkGR deploys a hybrid decoding strategy that interleaves unconstrained natural language generation (for thoughts) with constrained decoding (for docid tokens). Specifically, when a retrieval action is triggered, decoding is restricted to valid next tokens using an FM-index over semantic triples, guaranteeing only valid docids are produced. This approach balances the flexibility needed for reasoning with the precision of corpus referencing, avoiding docid hallucinations and ensuring generation efficiency.

Experimental Evaluation

Benchmarks and Metrics

ThinkGR is evaluated on four multi-hop retrieval benchmarks: HotpotQA, 2WikiMultihopQA, MuSiQue, and MoreHopQA. These datasets are selected to challenge the system with complex, multi-hop queries often involving new schema, variable corpus sizes, and varying degrees of question-document lexical overlap. Retrieval recall is the primary metric, supplemented by QA accuracy when answers are generated based on retrieved documents.

Main Results and Numerical Findings

ThinkGR achieves an average gain of +6.86% retrieval recall over the strongest baselines across all evaluated datasets.

On complex, out-of-domain data (MoreHopQA), ThinkGR outperforms the leading baseline by 5.68%, illustrating improved robustness and generalization to new schemas.
Ablation studies demonstrate that omitting either the CoT interleaving or the KTO RL phase substantially reduces retrieval accuracy, underscoring the necessity of both explicit thought-retrieval sequences and retrieval-grounded optimization.
In downstream QA, ThinkGR's average accuracy surpasses all baselines by 6.37%, confirming that retrieval improvements translate to superior end-to-end QA.

An important observation is that on HotpotQA, which exhibits high n-gram query-context overlap, dense retrievers relying on lexical similarity remain competitive, but ThinkGR still significantly outperforms all LLM-driven multi-step methods.

Efficiency and Scalability

ThinkGR demonstrates practical inference efficiency with single-pass autoregressive generation, yielding lower per-query latency than traditional LLM-driven multi-step methods that require iterative LLM-retriever handoffs. The FM-index facilitates constant-time constrained decoding with minimal storage overhead—a critical advantage for large corpora, with index storage reduced by over an order of magnitude compared to dense retrieval indices (by up to 41x).

Theoretical and Practical Implications

Theoretically, ThinkGR establishes that explicit interleaving of reasoning and retrieval, jointly optimized end-to-end, yields measurable gains for compositional, multi-hop retrieval. This supports the claim that unified CoT-augmented LLMs can bridge the semantic gap that plagues standard GR and dense retrieval.

Practically, the hybrid decoding strategy and RL-based, retrieval-grounded fine-tuning define a robust framework for deployable, high-fidelity retrieval in real-world settings, including those involving unseen schema or evolving corpora. ThinkGR's framework generalizes across backbone model scales and architectures and demonstrates consistent gains across varying sizes of base models and in training-free few-shot prompting settings (albeit with a performance gap to fine-tuned variants).

Limitations are acknowledged in terms of specificity to multi-hop scenarios (due to triple-based docid design), limited exploration of optimization strategies, and the need for further generalization to one-hop retrieval tasks and non-relational content.

Prospects and Future Directions

The results suggest several fertile directions:

Advanced Optimization: Incorporation of process-level rewards or curriculum learning to further refine reasoning quality.
Docid Representation: Development of richer, generalizable representations that capture both relational and non-factual content, potentially benefiting a broader class of retrieval tasks.
Generalized Evaluation: Expanding empirical studies to cover a more diverse range of tasks beyond multi-hop retrieval for broader characterization of the benefits and trade-offs in CoT-augmented generative retrieval.

Conclusion

ThinkGR provides substantive evidence that integrating chain-of-thought reasoning within generative retrieval delivers significant gains in complex, compositional retrieval tasks. The proposed hybrid decoding and dual-phase training allow for explicit, interpretable, and efficient reasoning-retrieval integration, validated by strong empirical results across challenging benchmarks. This work positions CoT-augmented generative retrieval as a promising paradigm for future research at the intersection of neural IR and deliberative LLM architectures, with practical implications for the deployment of modular, general-purpose reasoning engines in large-scale information systems.

Markdown Report Issue