Agentic Recommender Systems

Updated 4 July 2026

AgenticRS is a novel recommender paradigm that reframes recommendation as an autonomous, closed-loop process where agents perceive, decide, act, and learn.
The architecture is structured into decision, evolution, and infrastructure layers, enabling dynamic candidate generation, feedback integration, and adaptive personalization.
Key features include robust memory management, tool integration, and interactive feedback, enhancing scalability and continuous system optimization while addressing evaluation challenges.

Agentic Recommender System (AgenticRS) denotes a recommender paradigm in which recommendation is reorganized from a fixed pipeline or a monolithic “One Model” into a system of explicit agents that can perceive, decide, execute, receive feedback, and evolve over time (Hu et al., 27 Mar 2026). In this view, the recommender is no longer merely a passive scorer or generator; it becomes an active decision-maker that can maintain state, reason over goals, plan actions, invoke tools, interact with users over multiple turns, and revise its behavior from feedback (Huang et al., 23 Apr 2025). A central point in the literature is that AgenticRS does not mean turning every module into an agent: only modules that satisfy a functional closed loop, independent evaluability, and an evolvable decision space are promoted to agents (Hu et al., 27 Mar 2026).

1. Conceptual foundations and formal definitions

The conceptual break introduced by AgenticRS is a reframing of recommendation as a closed-loop decision process rather than a static scoring stack. One perspective defines an Agentic Recommender System as a system in which agents autonomously generate personalized recommendations by interacting with users and adapting to preferences over time, with formal tuple $(U, I, A, E, R)$ , where $U$ is the set of users, $I$ is the set of items, $A$ is the set of agents, $E$ is the set of environmental contexts, and $R: U \times E \times A \rightarrow P(I)$ is the recommendation function (Huang et al., 20 Mar 2025). In the same formulation, each agent operates by perceiving a state $s = f(u, e)$ , making decisions according to a policy $\pi_a(s)$ , and learning from feedback to maximize expected user utility (Huang et al., 20 Mar 2025).

A complementary formalism models an individual LLM agent as

$A_{\mathrm{LLM}} = \bigl(\mathcal{M}, \mathcal{I}, \mathcal{O}, \mathcal{F}, \Omega \bigr),$

where $\mathcal{M}$ is the language core, $U$ 0 the input space, $U$ 1 the output space, $U$ 2 the available tools or functions, and $U$ 3 the memory or state (Maragheh et al., 2 Jul 2025). The same work models a multi-agent recommender as

$U$ 4

where $U$ 5 is the agent set, $U$ 6 the shared environment, and $U$ 7 the communication protocol defined by a communication matrix and admissible message schemas (Maragheh et al., 2 Jul 2025). This makes agentic recommendation explicitly relational: the recommender is not only a policy over items, but also an arrangement of communicating agents, shared state, and tool-mediated interaction.

A recurring misconception addressed in the literature is that any conversational recommender or any LLM-enhanced ranker is already agentic. The agentic literature distinguishes such systems from simple prompting, fixed RAG, or prompt chaining by requiring an internal observe–decide–act loop, memory/state management, and some degree of autonomous continuation until the objective is reached (Maragheh et al., 2 Jul 2025). Another misconception is that “agentization” is universal; the foundational blueprint states that modules should be promoted to agents only when they participate in a complete operational loop, can be independently evaluated through stable interfaces and local metrics, and possess a meaningful decision space such as architecture, routing, scoring policy, or hyperparameters (Hu et al., 27 Mar 2026).

2. Architectural organization and closed-loop operation

The most explicit system-level blueprint organizes AgenticRS as a graph of interacting agents structured into three layers: a decision layer, an evolution layer, and an infrastructure layer (Hu et al., 27 Mar 2026). The decision layer directly serves recommendation requests and contains agents corresponding to candidate generation or recall, ranking, re-ranking or policy control, and routing or traffic orchestration. These agents output candidate sets, ranked lists, policy adjustments, or user-segment-specific routing decisions (Hu et al., 27 Mar 2026). The evolution layer hosts agents that improve the decision-layer agents by analyzing logs, rewards, failures, and experiment outcomes, then proposing better architectures, hyperparameters, routing rules, or training schemes. The infrastructure layer provides shared state and memory, user and item profiles, interaction histories, global constraints, meta-knowledge from past experiments, and orchestration and task scheduling (Hu et al., 27 Mar 2026).

The closed-loop abstraction underlying this architecture is stated as perception, decision, execution, and feedback. A module qualifies for agentization when it participates in that loop, has clear input/output interfaces, and can be improved without destabilizing the whole system (Hu et al., 27 Mar 2026). This criterion is important because it separates agentic recommender design from arbitrary modularization. A pipeline can be modular and still be static; AgenticRS requires modules with explicit responsibilities, direct outcome links, and autonomous update potential.

Industrial system design papers extend this blueprint into full-lifecycle automation. “AgenticRS-Architecture: System Design for Agentic Recommender Systems” describes AutoModel as an agent-based architecture for the full lifecycle of industrial recommender systems, with three core evolution agents: AutoTrain for model design and training, AutoFeature for data analysis and feature evolution, and AutoPerf for performance, deployment, and online experimentation (Zhang et al., 27 Mar 2026). A shared coordination layer decomposes goals into cross-agent workflows, while a shared knowledge layer stores problem definitions, feature and model configurations, training logs, evaluation results, online experiment conclusions, reward signals, and credit assignment records (Zhang et al., 27 Mar 2026). This suggests a broader interpretation of AgenticRS: the term can denote not only online recommendation-time agents, but also agentic control of the entire recommender lifecycle.

The same architectural logic appears in optimization-oriented systems. EvoRec is organized around an Orchestrator Agent, Research Agent, Code Agent, and Skill Evolver, with durable components consisting of the Model, the Skill library, and Memory (Mu et al., 15 Jun 2026). AgenticRecTune uses five specialized agents—Actor, Critic, Insight, Skill, and Online—to optimize system-level configurations across pre-ranking, ranking, and re-ranking, with a self-evolving Skillhub that summarizes historical outcomes into reusable optimization knowledge (Wu et al., 21 Apr 2026). In both cases, recommendation quality is treated as the outcome of an agentic ecosystem rather than as the output of a single frozen model.

3. Memory, tool use, planning, and interaction

Memory is one of the most developed differentiators of AgenticRS. Survey literature describes memory as the mechanism by which an agent stores user history, prior interactions, and possibly past actions, supporting persistent personalization, coherent multi-turn conversations, and stateful decision-making (Huang et al., 23 Apr 2025). A more formal memory model decomposes agent memory into working or short-term memory, episodic memory, semantic memory, and procedural memory, with explicit update and retrieval operators (Maragheh et al., 2 Jul 2025). This gives AgenticRS a state model substantially richer than a conventional user embedding.

A concrete realization is the hierarchical belief-state memory framework described in “Agentic Recommender System with Hierarchical Belief-State Memory” (Shen et al., 14 May 2026). That paper treats recommendation as a partially observable problem and maintains a structured belief state

$U$ 8

where event memory stores raw observations, preference memory stores fine-grained mutable preference chunks with explicit strength and evidence tracking, and profile memory distills preferences into a coherent natural-language narrative (Shen et al., 14 May 2026). The system defines six memory operations—extraction, reinforcement, weakening, consolidation, forgetting, and resynthesis—scheduled by an LLM-based planner rather than fixed heuristics (Shen et al., 14 May 2026). The significance is not merely the presence of memory, but the explicit lifecycle governing how beliefs are extracted, revised, compressed, and removed.

MemRec advances a different memory line by moving from isolated memory to collaborative memory. It models memory as a graph

$U$ 9

where each user or item node stores semantic memory and a dedicated $I$ 0 manages collaborative retrieval, synthesis, and asynchronous propagation, while a downstream $I$ 1 performs grounded reasoning and ranking (Chen et al., 13 Jan 2026). The framework’s key architectural decision is to decouple memory management from recommendation reasoning so that graph-level collaborative signals can be used without overloading the reasoning LLM (Chen et al., 13 Jan 2026).

Tool use is likewise central. ChainRec models recommendation as a finite-horizon MDP

$I$ 2

with actions consisting of standardized reasoning tools and a terminal ranking action (Li et al., 11 Feb 2026). A learned planner dynamically selects which tool to call, in what order, and when to stop, using a Tool Agent Library mined from expert trajectories (Li et al., 11 Feb 2026). The paper’s emphasis is that recommendation scenarios differ substantially—cold-start, evolving-interest, and sparse item-side contexts require different evidence-gathering chains—and therefore the reasoning workflow should not be fixed (Li et al., 11 Feb 2026). This is a strong example of planning as action selection over tool chains rather than as free-form chain-of-thought.

Interactive feedback further differentiates AgenticRS from static recommenders. RecoWorld defines an agentic recommender as an autonomous recommendation agent that can perceive user state and context, reason and plan over user instructions, use tools or actions to update recommendation lists, and maintain memory of past interactions (Liu et al., 12 Sep 2025). Its dual-view architecture places a simulated user and an agentic recommender in a multi-turn environment in which the user simulator may click, watch, skip, or leave, and when near disengagement generates reflective instructions such as “show me more interesting content” (Liu et al., 12 Sep 2025). This converts recommendation from one-shot prediction into a trajectory-level interaction process.

4. Representative system families and application domains

A substantial portion of the literature consists of concrete AgenticRS instantiations specialized for distinct recommendation problems. In multi-agent collaborative recommendation, MACF reinterprets classical collaborative filtering as LLM-agent collaboration by instantiating similar users as user agents and query-relevant historical items as item agents, coordinated by a central orchestrator that performs dynamic agent recruitment and personalized collaboration instruction (Xia et al., 23 Nov 2025). The system uses retrieval tools such as $I$ 3, $I$ 4, $I$ 5, and $I$ 6, and empirically achieves the best overall performance across Amazon Clothing, Amazon Beauty, and Amazon Music under H@10 and N@10 (Xia et al., 23 Nov 2025). The significance of MACF is that it preserves the inductive bias of collaborative filtering while replacing static aggregation with interactive, query-aware collaboration.

In multimodal and cold-start recommendation, MARC implements Agentic Retrieval-Augmented Generation for cocktail recommendation using a task recognition router and a reflection process over a graph database built from Kaggle cocktail data (Cho et al., 11 Nov 2025). The router classifies queries into four tasks—Color-Ingredient Visual Search, Glass Type with Ingredient Matching, Multi-hop Ingredient Expansion, and Cocktail Similarity and Alternative—while the reflection module scores retrieved candidate sets on relevance, diversity, completeness, and coherence, triggering retries when the overall score is below 80 (Cho et al., 11 Nov 2025). The paper reports that graph RAG outperformed vector-only RAG in both LLM-as-a-judge and human evaluation (Cho et al., 11 Nov 2025). This supports a domain-specific form of agentic recommendation in which routing, retrieval control, and self-critique are first-class components.

Scientific dataset recommendation supplies another domain-specific instantiation. ScienceDB AI is an LLM-driven agentic recommender for scientific data sharing services, with three main modules: an Experimental Intention Perceptor, a Structured Memory Compressor, and a Trustworthy Retrieval-Augmented Generation framework (Long et al., 3 Jan 2026). It formalizes multi-turn scientific recommendation over a query sequence $I$ 7 and a candidate dataset corpus $I$ 8 with more than 10 million datasets, and uses a parsed intent representation

$I$ 9

for retrieval and state update (Long et al., 3 Jan 2026). Its trustworthiness mechanism requires citable CSTR identifiers in the final response, reducing freedom to hallucinate nonexistent datasets (Long et al., 3 Jan 2026). This illustrates how agentic recommendation can be adapted to high-stakes, domain-constrained retrieval settings.

At smaller granularity, AgentRec frames agent recommendation itself as a recommender problem. It routes a natural-language prompt to the most appropriate LLM agent by encoding prompts with a fine-tuned Sentence-BERT-style encoder, comparing the query embedding against per-agent embedding corpora with cosine similarity, and selecting the agent with the highest aggregate score (Park et al., 23 Jan 2025). The model reports 92.2% top-1 test accuracy and less than 300 milliseconds per prompt when embeddings are cached (Park et al., 23 Jan 2025). Although the recommendation target is an agent rather than an item, the system functions as an agentic router within larger multi-agent architectures.

Self-evolving preference propagation systems extend AgenticRS beyond session-level personalization. RecNet introduces user agents, item agents, and router agents that proactively propagate real-time preference updates across related users and items, then optimize the propagation strategy through a feedback-driven mechanism framed as a multi-agent reinforcement learning analogue with textual credit assignment (Li et al., 29 Jan 2026). Its forward phase routes preference updates through router agents, while its backward phase uses LLMs for credit assignment, gradient analysis, and module-level optimization (Li et al., 29 Jan 2026). This suggests an interpretation of recommendation as an evolving communication network rather than a direct map from user history to ranked items.

5. Optimization, reward modeling, simulation, and evaluation

AgenticRS has driven a corresponding shift in training and evaluation. RecoWorld is described as a blueprint for simulated environments tailored to agentic recommender systems, explicitly motivated by the need for a safe training space where agents can learn from errors without impacting real users (Liu et al., 12 Sep 2025). It models recommender–user interaction as a Markov Decision Process with user mindset state $A$ 0, action policy $A$ 1, transition

$A$ 2

and reward function

$A$ 3

where rewards can combine total session time, clicks, and self-critique scores (Liu et al., 12 Sep 2025). The explicit optimization target is long-term retention and engagement rather than only immediate relevance.

Reward modeling has likewise become multidimensional. RecRM-Bench is introduced as the first comprehensive benchmark specifically engineered for reward modeling in agentic recommender systems and contains 1,073,779 high-quality samples in total, with over 1.1 million structured entries across four sub-databases: Instruction Following, Factual Consistency, Query-Item Relevance, and User Behavior Prediction (Zeng et al., 12 May 2026). On top of this benchmark, RecRM-RL trains separate reward models and integrates them with a hierarchical reward function:

$A$ 4

which prunes invalid outputs early, then uses relevance as a gate before ranking and behavior rewards are activated (Zeng et al., 12 May 2026). The paper reports that adding reward components incrementally improves final behavior prediction by 19% overall, with a 7.8% gain after adding relevance rewards (Zeng et al., 12 May 2026). This makes explicit the claim that single-dimensional terminal rewards are insufficient for agentic recommendation.

Benchmarking has also become interactive and scenario-based. AgentRecBench provides an interactive textual recommendation simulator with rich user, item, and review metadata across Yelp, Goodreads, and Amazon, and evaluates classic, evolving-interest, and cold-start recommendation tasks (Shang et al., 26 May 2025). It defines actions over recommendation and information-seeking, compares 10 classical and agentic methods, and reports that the strongest systems are agentic rather than classical (Shang et al., 26 May 2025). The benchmark was validated through the AgentSociety Challenge, which attracted 295 competing teams worldwide and over 1,400 submissions during a 37-day competition (Shang et al., 26 May 2025). This suggests that AgenticRS evaluation requires environments that expose metadata, tool access, and scenario shifts not captured by conventional offline ranking benchmarks.

A more stringent evaluation move appears in $A$ 5-Rec, which replaces LLM-as-a-judge with verifiable rewards and reveal-tagged elicitation over structured catalog predicates (Narasimhan et al., 8 Jun 2026). The primary reward is

$A$ 6

and the benchmark reports pass $A$ 7 reliability rather than only single-run capability (Narasimhan et al., 8 Jun 2026). Its main empirical finding is a steep reliability cliff: even the strongest model reaches only pass $A$ 8, pass $A$ 9, and pass $E$ 0 (Narasimhan et al., 8 Jun 2026). The benchmark also formalizes policy dimensions such as recommend_tool, watch_history, availability, age_restricted, sponsored, transparency, and single_recommendation (Narasimhan et al., 8 Jun 2026). In evaluation terms, this redefines success from plausible recommendation output to consistent task completion under hidden constraints and policy rules.

6. Limitations, controversies, and open research directions

The literature repeatedly describes AgenticRS as promising but operationally difficult. The foundational blueprint is explicitly conceptual: it states that it is not introducing a specific algorithm, provides no full end-to-end implementation, no empirical benchmark, no explicit mathematical training objective, and no demonstrated production deployment in the paper text (Hu et al., 27 Mar 2026). This limitation matters because the term “AgenticRS” covers a design space broader than any single implemented system.

Several cross-cutting technical difficulties recur. Survey work identifies scalability issues, real-time inference cost, difficulty evaluating agent effectiveness, unpredictability of autonomous decisions, dependence on memory quality, and tool brittleness (Huang et al., 23 Apr 2025). The multi-agent perspective formalizes additional challenge families: protocol complexity, scalability, hallucination and error propagation, emergent misalignment including covert collusion, and brand compliance (Maragheh et al., 2 Jul 2025). In that formalization, communication among many agents can become a bottleneck, invalid messages can propagate through shared memory and downstream agents, and individually rational agent behavior can produce system-level misalignment (Maragheh et al., 2 Jul 2025).

Evaluation findings reinforce these concerns. $E$ 1-Rec shows that single-run success can conceal low repeated reliability, especially when hidden constraints must be elicited through dialogue (Narasimhan et al., 8 Jun 2026). RecoWorld and related simulation work imply that recommendation must increasingly be treated as sequential decision-making over trajectories rather than as static next-item prediction (Liu et al., 12 Sep 2025). This suggests that conventional metrics such as HR@K or NDCG@K remain useful but are insufficient on their own. A plausible implication is that future benchmarks will need to combine verifiable task success, reliability across repeated trials, policy compliance, and longer-horizon reward modeling.

Open research directions are explicit. Perspective and survey papers call for multimodal agentic recommendation, retrieval-augmented agentic recommender systems, better planning and reflection, more suitable interactive and multimodal benchmarks, efficiency and scalability techniques, stronger safety and privacy mechanisms, and better user simulators and environment modeling (Huang et al., 20 Mar 2025). The multidimensional reward literature suggests that reliable AgenticRS will require explicit optimization of instruction following, factual consistency, query-item relevance, and user behavior rather than terminal engagement alone (Zeng et al., 12 May 2026). Across the cited work, the common trajectory is from manual optimization of static recommenders toward modular autonomy, explicit agent interfaces, layered rewards, tool-grounded interaction, and continuous adaptation under business and policy constraints (Hu et al., 27 Mar 2026).