Generative Relevance Feedback

Updated 26 November 2025

Generative Relevance Feedback is a retrieval technique that uses synthetic feedback from LLMs to improve query representations and bridge vocabulary gaps.
It combines query expansion and embedding-based query reformulation, yielding up to 19% improvement in nDCG and MAP over traditional pseudo-relevance feedback.
GRF is applied in sparse, dense, and multimodal IR settings, offering enhanced recall and precision for challenging and ambiguous queries.

Generative Relevance Feedback (GRF) is a family of retrieval enhancement techniques in which feedback signals are constructed and integrated using outputs from generative models—primarily LLMs—rather than directly from documents retrieved by the initial query. GRF encompasses both query expansion (generating new terms, phrases, or entire queries) and embedding-based query reformulation using synthetic, query-aligned content. In modern information retrieval (IR), GRF is applied in both sparse and dense retrieval paradigms, as well as in multimodal tasks such as text-to-image retrieval via vision-LLMs (VLMs). Empirical studies on GRF show robust improvements (typically 3–19% on nDCG/MAP) over classical pseudo-relevance feedback (PRF), especially on “hard” queries, and establish new state-of-the-art (SOTA) results across diverse IR benchmarks (Mackie et al., 2023, Wang et al., 2023, Dhole et al., 2024, Khaertdinov et al., 21 Nov 2025).

1. Motivation and Core Concepts

Pseudo-relevance feedback (PRF) and its dense analogs (vector PRF) traditionally operate by assuming that the top-k retrieved documents for a user query $q$ are mostly relevant. Feedback signals are then derived by estimating term or embedding distributions from these documents. However, this approach is limited by the relevance assumption (quality of top-k) and the model assumption (feedback tied to retriever architecture) (Tu et al., 29 Oct 2025). Moreover, PRF often struggles on low-recall queries or with ambiguous intents, as spurious documents can “drift” the reformulated query away from the intended information need.

GRF replaces—or supplements—the PRF set with synthetic content generated by an LLM, either with the query alone (zero-shot) or with additional context (pseudo- or explicit relevance feedback). This generative content can be used to:

Expand the query with new terms not present in the initial corpus (Mackie et al., 2023, Mackie et al., 2023)
Reformulate the query into more expressive or semantically-aligned variants (Wang et al., 2023, Dhole et al., 2024)
Provide natural-language or vector-based feedback for more robust embedding updates (especially in vision-language tasks) (Khaertdinov et al., 21 Nov 2025)

GRF’s defining characteristic is its decoupling of feedback from the initial retrieval corpus, instead leveraging LLMs’ world knowledge and generation capacity to enhance recall, bridge vocabulary mismatch, and offer more interpretable expansion signals.

2. GRF Methodologies and Algorithmic Approaches

2.1. Generative Query Expansion and Reformulation

Several canonical GRF algorithms have emerged:

Zero-shot Generation: Prompt an LLM with the original query $q$ to produce synthetic expansions or reformulations (e.g., keywords, entities, alternate queries, long-form passages). The feedback model $P(w|D_{LLM})$ is then estimated over all generated text $D_{LLM}$ ; this model is interpolated with the original query for the final expanded representation (Mackie et al., 2023, Mackie et al., 2023).
Context-driven GRF (GenPRF and Post-Retrieval GRF): Provide the LLM with the query plus a set of top- $k$ retrieved documents or passages as additional context, conditioning expansions on both user need and corpus evidence (Wang et al., 2023, Dhole et al., 2024, Parry et al., 2024). This mechanism supports both single-turn (one reformulation per query) and iterative feedback loops.
GRF for Vision-LLMs: In text-to-image retrieval, GRF uses a captioning model (e.g., LLaVA-1.5) to generate synthetic captions for top-ranked images. These captions, encoded as text embeddings, serve as positive feedback vectors for iterative query refinement via a Rocchio-style update (Khaertdinov et al., 21 Nov 2025).

2.2. Relevance-Aware Weighting

Raw LLM-generated content may contain off-topic (“hallucinated”) information. To address this, several GRF variants use relevance-aware weighting:

RASE (Relevance-Aware Sample Estimation): Each generated document is matched to real corpus documents, scored by their similarity and downstream retrieval utility (e.g., MonoT5 ranks), and weighted accordingly in the feedback model (Mackie et al., 2023).
Utility-Oriented RL (Generalized PRF): The LLM is trained to generate rewrites that maximize retrieval gain (e.g., difference in nDCG@10 over the baseline), using reinforcement learning to optimize for high-utility reformulations, thus reducing the impact of noisy feedback or hallucinations (Tu et al., 29 Oct 2025).

2.3. Core Equations

A representative GRF expansion in the RM3-style for sparse retrieval is:

$P_{GRF}(w \mid R) = \beta P(w \mid Q) + (1-\beta) P(w \mid D_{LLM}),$

where $Q$ is the query, $D_{LLM}$ is the generated synthetic corpus, and $\beta$ tunes the original-query weight (Mackie et al., 2023). For dense (embedding) or VLM settings, Rocchio-style updates blend query and generated feedback vectors:

$\vec{q}_{GRF} = \alpha \vec{q} + \beta \bar{d},$

with $\bar{d}$ the mean embedded synthetic feedback and $\alpha, \beta$ controlling weighting (Mackie et al., 2023, Khaertdinov et al., 21 Nov 2025).

3. Empirical Results and Benchmarks

3.1. Text Retrieval (Sparse, Dense, and Hybrid)

Key findings from large-scale evaluations include:

Dataset/Metric	RM3/Pseudo-PRF	GRF	Relative Gain	Reference
Robust04(NDCG@10)	0.451	0.528	+17.1%	(Mackie et al., 2023)
CODEC(MAP)	0.239	0.285	+19.2%	(Mackie et al., 2023)
DL-19(MAP)	0.383	0.441	+15.1%	(Mackie et al., 2023)
Robust04 R@1k	0.777	0.788	+1.4%	(Mackie et al., 2023)

GRF outperforms PRF and neural PRF methods (e.g., CEQE, TCT+RM3, BERT-QE) in both precision-oriented and recall-oriented metrics, with the largest improvements on “hard” queries (defined by low first-pass nDCG). These results are robust across retriever classes (BM25, SPLADE, ColBERT), and hold after fusion with classic PRF signals (Mackie et al., 2023).

3.2. Vision-Language and Multimodal Retrieval

In VLM-based text-to-image tasks, GRF delivers consistent performance boosts (MRR@5 +3–5 percentage points for smaller models, +1–3 points for larger models) relative to both no-feedback and PRF baselines (Khaertdinov et al., 21 Nov 2025). For instance, with CLIP-ViT-B/32, baseline MRR@5 is 0.758 vs. 0.789 with GRF. Explicitly annotated or attentively aggregated feedback (e.g., AFS) can further enhance or stabilize gains.

3.3. Query Reformulation and Ensemble Methods

Ensemble GRF strategies—using multiple query paraphrases/prompts and fusion of expansion signals—achieve relative gains up to +18% in nDCG@10 (pre-retrieval) and +9% (post-retrieval) over previous SOTA on robust and benchmark datasets (Dhole et al., 2024). Oracle feedback reveals substantial remaining headroom (e.g., nDCG@10 reaching 0.820 vs. 0.576 in best LLM-only runs for TREC-19).

4. Theoretical Analysis and Comparative Insights

4.1. Robustness to Noisy or Off-Topic Feedback

By decoupling feedback construction from corpus dependency, GRF reduces susceptibility to the failure modes of PRF, especially on queries for which the top-k retrieved documents lack coverage or are out of domain (Mackie et al., 2023, Mackie et al., 2023). However, caveats persist—LLMs can hallucinate plausible but irrelevant expansions, motivating the use of utility-based LLM fine-tuning, ensemble filtering, or feedback fusion.

4.2. Complementarity of GRF and PRF

Empirically, PRF and GRF capture complementary signals: PRF provides corpus-grounded on-topic expansion, while GRF introduces world knowledge and alternative phrasing/facts. Weighted fusion (e.g., WRRF) yields consistently higher recall (+6–9%) and competitive precision compared to either signal alone (Mackie et al., 2023).

4.3. Modality Alignment and Feedback Efficacy

In multimodal contexts, GRF’s success is attributed to bridging modality gaps (e.g., converting image embeddings into semantically rich textual captions), which align better with text queries in joint embedding spaces (Khaertdinov et al., 21 Nov 2025). In code generation, interpolating model and feedback vocabulary distributions during decoding increases BLEU scores relative to naïve retrieval-based augmentation (Gemmell et al., 2020).

5. Implementation, Complexity, and Practical Recommendations

5.1. Generation Subtasks and Prompting

Tasks studied include:

Short-form: keyword/entity listing, alternative queries
Chain-of-thought (CoT): rationale generation for terms/entities
Long-form: facts, simulated documents/essays/news (as natural expansion contexts)

Combining multiple subtasks or prompt variants generally yields higher recall and robustness. Temperature, top-p, and prompt templates are tuned per task (Mackie et al., 2023, Dhole et al., 2024).

5.2. Feedback Document Selection

Post-retrieval GRF leverages passage selection strategies (FirstP, TopP, MaxP) to construct context inputs for the LLM; pre-retrieval approaches are purely zero-shot. Fusion, re-ranking, and adaptive feedback (oracle, critic LLM) are deployed to maximize downstream ranking metrics (Dhole et al., 2024).

5.3. Cost and Efficiency

GRF systems incur higher computational overhead than PRF due to LLM inference, especially when multiple prompts, passages, or ensemble variants are employed. Pre-computation (offline captioning, document expansion) or few-shot/adaptive prompt designs are potential mitigations (Mackie et al., 2023, Dhole et al., 2024).

6. Limitations, Challenges, and Future Research

Principal limitations of GRF include:

Hallucination and domain mismatch: LLMs may produce factually inaccurate or off-corpus expansions, risking query drift or loss of precision (Tu et al., 29 Oct 2025, Mackie et al., 2023).
Feedback dependency: Contextual GRF may still falter when the feedback set is of low quality; utility-based objective functions and robust passage selection are active research areas.
Computational costs: Multiple LLM calls per query increase resource requirements for production systems.
Generalization: Cross-domain robustness is limited by the underlying LLM and the alignment between training and deployment corpora (Tu et al., 29 Oct 2025).

Recent work on Generalized Pseudo-Relevance Feedback (GPRF) uses reinforcement learning to train LLMs for utility-maximizing query reformulation, simultaneously relaxing model and relevance assumptions and obtaining up to 40% relative NDCG@10 gains over strong baselines (Tu et al., 29 Oct 2025).

7. Summary Table: GRF Paradigms and Applications

Task/Paradigm	Feedback Source	Main Mechanism	Key Gains/Uses	References
Sparse IR	LLM-generated text	RM3-style expansion	+5–19% MAP, +17–24% NDCG@10	(Mackie et al., 2023)
Dense IR	LLM embeddings	Vector-Rocchio update	+2–13% R@1000	(Mackie et al., 2023)
VLM Retrieval	Synthetic captions	Text embedding fusion	+1–5 pp MRR@5	(Khaertdinov et al., 21 Nov 2025)
Ensemble QR	Multiple LLM prompts	Query fusion, rank fusion	+9–18% nDCG@10	(Dhole et al., 2024)
Code Generation	Retrieved code as feedback	Interpolative decoding	+1–2 BLEU	(Gemmell et al., 2020)
RL-enhanced GRF	Context + reward feedback	SFT+RL, utility-based	+40% NDCG@10	(Tu et al., 29 Oct 2025)

GRF now constitutes a cornerstone technique for robust, highly generalizable query expansion and reformulation in modern IR systems, spanning sparse, dense, learned sparse, and multimodal domains.