- The paper introduces a utility-centric retrieval framework that shifts focus from topical relevance to the evidence’s impact on LLM-generated responses.
- It differentiates between LLM-agnostic and LLM-specific utility, outlining how proxy metrics and model-specific tailoring improve retrieval performance.
- Empirical results demonstrate that context-dependent utility modeling significantly boosts multi-hop reasoning and answer accuracy in RAG pipelines.
Beyond Relevance: Utility-Centric Retrieval in the LLM Era
The dominance of topical relevance as the core objective in information retrieval (IR) has been rendered insufficient by the proliferation of LLMs and the adoption of retrieval-augmented generation (RAG) architectures. Whereas classical IR paradigms assume retrieval systems serve human users directly by presenting relevant documents, RAG fundamentally alters this pipeline: retrieved documents are now intermediate evidence consumed by LLMs, which in turn synthesize responses. This separation between retrieval result consumption and evaluation necessitates the formulation of new retrieval objectives, namely those centered around utility—whether retrieval enhances the downstream generation performance—rather than mere topical alignment.
Classical Notions: From Topical Relevance to User-Centric Utility
Historically, retrieval effectiveness has been measured via metrics such as nDCG, MAP, and MRR, which are dependent on manually or semi-automatically labeled topical relevance. However, extensive literature in IR and library science distinguishes between relevance and utility: the latter emphasizes the real-world usefulness of retrieved content for end-user goals and is often inferred from implicit behavioral proxies such as clicks, dwell time, or purchases. Despite this, practical challenges in automatic utility estimation have relegated direct utility optimization to a secondary concern, with topical relevance remaining the dominant optimization proxy.
Recent developments in the web search and recommender systems domains have advanced utility-aware retrieval through holistic and interaction-aware models, optimizing for list-level outcomes and capturing session-level user satisfaction and task success. However, even these advances presume direct human evaluation of retrieval outcomes.
Emergence of LLM-Centric Utility and Its Taxonomy
The RAG paradigm introduces a critical shift: retrieval outcomes are now evaluated in terms of their contribution to LLM generation quality, rather than their direct usefulness to humans. This distinction gives rise to LLM-centric utility, with two orthogonal axes:
LLM-Agnostic vs. LLM-Specific Utility
- LLM-Agnostic Utility: Assumes intrinsic evidence utility holds across LLMs. Estimation typically leverages proxy metrics (e.g., BLEU, ROUGE, EM, F1 relative to ground-truth answers), answer likelihoods conditioned on evidence, and aggregate generation performance with or without each retrieved document. This is efficient and generalizable across models but neglects differences in model competence, internal knowledge, and reasoning strategies.
- LLM-Specific Utility: Recognizes variance in evidence usefulness across LLMs due to discrepancies in pretraining data, reasoning, and inductive biases. Utility labeling and retriever optimization become tailored to specific target LLMs, resulting in higher alignment with model-specific generation but reduced transferability across architectures. Empirical studies demonstrate the non-triviality of such LLM dependencies [zhang2025llm].
Context-Independent vs. Context-Dependent Utility
- Context-Independent Utility: Per-document utility estimated in isolation. This assumption simplifies annotation and supervision pipelines but fails to capture redundancy, complementarity, or antagonism between evidence items.
- Context-Dependent Utility: The utility of a document is dependent on the set of co-retrieved evidence, with explicit modeling of interactions. This is particularly salient in multi-hop reasoning and multi-aspect questions; effective retrieval thus requires setwise optimization strategies that maximize joint generation performance [jain2025modeling].
Another evolution is reframing retrieval as an adaptive process to satisfy the latent knowledge needs of the LLM mid-generation. Techniques include uncertainty-based query generation, iterative retrieval loops, and attention-guided expansion. In agentic RAG, the system autonomously refines retrieval queries and reasoning strategies through reinforcement learning, with final answer quality providing supervisory signals [jin2025search, gao2024smartrag, zheng2025deepresearcher].
Despite progress, agentic RAG approaches often fix the underlying retrieval method (e.g., BM25 or dense retrievers) and tune query formulation or evidence synthesis, thus missing an opportunity to jointly optimize the retriever for utility-aware metrics. Unifying these perspectives necessitates research in co-evolving retrieval and generation objectives based on LLM-centric utility.
Numerical and Empirical Perspectives
Recent work demonstrates that optimizing for LLM-centric utility—both agnostic and specific—leads to improved downstream answer accuracy, reasoning step coverage, and factual consistency in RAG pipelines, compared to standard relevance-optimized retrievers [zhang2024large, zhang2025llm, ke2024bridging]. For example, context-dependent utility modeling yields significant gains in multi-hop QA, where classical retrievers fail to assemble non-redundant supporting evidence sets.
The paper asserts that topical relevance is no longer sufficient as an optimization target in RAG systems and that utility-centric objectives, particularly those reflecting LLM-specific task outcomes, must replace or augment historical relevance-based metrics. This challenges the long-held probabilistic ranking principle that underpins most retrieval architectures.
Practical and Theoretical Implications
The shift toward utility-centric retrieval carries several implications:
- Evaluation: Traditional IR test collections and relevance judgments are misaligned with RAG objectives. New benchmarks should focus on downstream answer quality or end-task satisfaction as the target metric, with utility labels inferred from LLM performance differentials.
- Training: Supervised retriever learning should incorporate LLM-centric utility signals, either through direct supervision (LLM-in-the-loop) or preference-based reward modeling.
- System Design: RAG workflows must support joint optimization of retriever and generator components; static retrieval methods are insufficient for adaptive, model-specific, or context-dependent needs.
- Research Directions: Open problems include scalable utility annotation frameworks, inter-model utility generalization, set-based utility optimization, and reinforcement learning-based agentic retrieval.
The theoretical ramifications intersect with the information-theoretic view of utility, session-level search economics, and a reappraisal of IR’s foundational constructs in the context of automated reasoning agents.
Conclusion
The paper articulates a unified, utility-centric framework for retrieval in the age of LLMs. By delineating the axes of LLM-agnostic versus LLM-specific utility and context-independent versus context-dependent estimation, this work exposes the limitations of relevance-centric optimization and grounds a reorientation of retrieval science for RAG. The proposed agenda is both conceptual and practical, calling for new evaluation, annotation, and end-to-end optimization practices that align retrieval objectives with the requirements of LLM-driven information access. Future developments are expected to yield retrievers that are dynamically adaptable, generation-aware, and optimized for maximal utility as defined by LLM-consumed downstream metrics (2604.08920).