RecoWorld: Simulated Recommender Framework
- RecoWorld is a simulated environment for agentic recommender systems enabling multi-turn, dialogic interactions between users and agents to maximize long-term engagement.
- It employs a dual-view architecture with simulated users and recommendation agents, incorporating reflective feedback and Markovian protocols to adapt strategies.
- The framework integrates text, multimodal, and semantic modeling, supporting advanced multi-agent simulations and safe benchmarking of reinforcement learning methods.
RecoWorld is a simulated environment framework for agentic recommender systems designed to facilitate the training, evaluation, and benchmarking of recommendation agents through rich, multi-turn, user-in-the-loop interactions. It emphasizes a dual-view structure, where a simulated user and a recommendation agent engage in dialogic, session-based exchanges that seek to maximize long-term user retention and engagement. RecoWorld is organized around several technical pillars, including a Markovian user–agent interaction protocol, instruction-following via explicit user feedback and reflective reasoning, content modeling at multiple levels (textual, multimodal, semantic), and reinforcement learning (RL) optimization over repeated sessions. The environment enables advanced experimentation by supporting multi-agent simulations and by providing a flexible API for integrating state-of-the-art agentic and user simulation modules (Liu et al., 12 Sep 2025).
1. Dual-View Architecture: User Simulator and Agentic Recommender
RecoWorld’s architecture is built around the continuous exchange between a simulated user and an agentic recommender. At each interaction round, the recommender generates a candidate list of items by leveraging components such as candidate retrieval, ranking, and re-ranking. The user simulator processes this list in sequence, making decisions for each item based on its current mindset and interaction context.
The user simulator performs three discrete steps per item:
- “Think it through”: The simulator reasons about the item in the context of prior interactions and session context.
- “Take action”: An action is selected from a discrete space , with possible actions including Click, Comment, Share, Like, Watch (with duration), Skip, and Leave.
- “Update your mindset”: The user’s internal state is revised so that subsequent decisions reflect session evolution and changing engagement patterns.
Upon selecting Leave, the user simulator is prompted to synthesize its session experience by generating a reflective instruction—such as “I’d like more diverse content”—which is then fed to the agentic recommender. This initiates an adaptive update of the recommendation strategy for subsequent sessions (Liu et al., 12 Sep 2025).
2. Markovian Interaction, Reflective Instructions, and Feedback Loop
The interaction dynamics within RecoWorld are modeled as a Markov Decision Process (MDP), with the user’s mindset or internal state forming the environment’s state. The recommender, parameterized by a policy , selects actions (lists or slates) in response to , and the user’s state evolves according to a stochastic transition model .
Crucially, the user simulator issues explicit reflective instructions following disengagement. These instructions—represented as natural language or structured signals—provide direct, interpretable feedback that can be incorporated by the agent. This enables agentic recommenders to adjust strategies not only based on implicit behavioral traces (e.g., drops in session time or click rates) but also via explicit user guidance (e.g., requests for more diversity or novelty).
This instruction-following loop underpins a new paradigm: “user instructs, recommender responds,” in which the agent must synthesize both behavioral and reflective signals to optimize cumulative engagement (Liu et al., 12 Sep 2025).
3. Content and User Modeling: Text, Multimodal, and Semantic ID
RecoWorld is designed to support multiple content and user representation strategies:
- Text-Based Modeling: User profiles, item metadata, and histories are represented as natural language strings. This facilitates leveraging the reasoning and in-context learning capabilities of LLMs, which enables nuanced instruction-following and dynamic adaptation based on session context.
- Multimodal Modeling: Integration with vision-LLMs (VLMs) such as Qwen2.5 Omni or Gemini-2.5-Pro allows the simulator and agents to process visual (e.g., images), auditory (e.g., audio clips), and auxiliary metadata, enhancing realism and enabling a broader spectrum of recommendation scenarios.
- Semantic ID Modeling: Each item can be mapped to a compact “semantic ID” reflecting a dense, hierarchical encoding of item features. This approach supports efficient sequence modeling and allows for coarse-to-fine evaluation of similarity and diversity in recommendations.
Each representation alternative presents distinct trade-offs with respect to fidelity, generalization, and computational efficiency, allowing tailored experimentation for different research objectives (Liu et al., 12 Sep 2025).
4. Multi-Turn Reinforcement Learning Optimization
In RecoWorld, the agentic recommender’s task is formalized as maximizing a reward function over a sequence of interaction rounds. Multi-turn RL algorithms—such as Proximal Policy Optimization (PPO) or Direct Preference Optimization (DPO)—are employed to update policy parameters based on long-horizon feedback, which may include both immediate engagement signals (e.g., click, watch time) and latent objectives (e.g., session depth, retention).
For each candidate recommendation list, cumulative trajectory rewards are computed: the policy is optimized to maximize either expected session time or user-defined engagement metrics: End-to-end RL training enables the recommender to discover complex policies that balance short-term relevance and long-term satisfaction. The presence of explicit instructions and user reasoning traces introduces a richer reward structure and supports complex behaviors such as curiosity-driven exploration, diversity-seeking, or resilience to user disengagement (Liu et al., 12 Sep 2025).
5. Multi-Agent Simulation and Population Dynamics
Beyond dyadic (single-user) interaction, RecoWorld extends to simulate populations of users. In this setting, each user’s latent state is updated through both individual experiences and interactions with neighboring users, formalized as:
where is the feedback exchanged between users and , and is the evolving environment. Multi-agent simulation enables studying collective phenomena such as opinion dynamics, content virality, group-level satisfaction, and adversarial gaming of recommendation logic, making the framework suitable for research into social recommendation, algorithmic fairness, and creator–audience feedback loops (Liu et al., 12 Sep 2025).
6. Benchmarking Applications and Future Implications
RecoWorld serves both as a testbed for future agentic recommenders and as a safe, flexible environment to develop, benchmark, and refine algorithms prior to live deployment. The environment supports:
- Rapid prototyping and safe evaluation of novel RL-based recommendation strategies.
- Experimentation with advanced instruction-following, reflective learning, and collaborative filtering enhancements.
- Study of user-centric objectives, feedback-cycle dynamics, and new metrics for engagement, dissatisfaction, and retention.
By abstracting the complexities of live user experimentation, RecoWorld enables iterative improvement cycles and provides a reproducible, extensible platform for agentic recommender research. This marks a significant step toward recommender systems grounded in dynamic, collaborative user–agent interaction loops, capable of real-time adaptation to evolving preferences and collective behaviors (Liu et al., 12 Sep 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free