- The paper introduces SimUSER, a novel agent-based framework that employs LLMs to simulate user behavior in evaluating recommender systems.
- It details a multi-module cognitive architecture combining persona matching, visual perception, memory, and decision-making to replicate user interactions.
- Experiments on datasets like MovieLens show SimUSER outperforms baselines in mirroring real user engagement, offering a cost-effective alternative to traditional A/B tests.
SimUSER: Simulating User Behavior with LLMs for Recommender System Evaluation
Introduction
The paper "SimUSER: Simulating User Behavior with LLMs for Recommender System Evaluation" (2504.12722) addresses a critical challenge in the evaluation of Recommender Systems (RS) — the gap between offline metrics and online user behaviors. Offline evaluation often falls short in measuring key business values, such as user engagement and satisfaction, due to its non-interactive nature, while online A/B testing can be costly and labor-intensive. To bridge this gap, this work introduces SimUSER, an agent-based framework for simulating user interactions with recommender systems using LLMs as believable and cost-effective human proxies.
Methodology
Persona Matching
SimUSER's procedure commences with identifying consistent user personas from historical data. This involves extracting unique user preferences and profiling characteristics such as age, personality, and occupation. The personas are inferred utilizing the semantic capabilities of LLMs, producing candidate personas that maximize alignment with historical interactions. The matching is evaluated using a self-consistency scoring metric, which ensures that extracted personas correlate strongly with actual user behavior.
Interaction Simulation
In subsequent phases, personas are simulated in a cognitive architecture built upon LLMs composed of modules for perception, memory, and decision-making. The perception module integrates visual elements to replicate human reasoning influenced by visual stimuli. The memory module comprises episodic memory and knowledge-graph memory, vital for representing user-item interactions and external social influences. The perception of items is enriched using captions extracted from visual thumbnails, integrating emotional and content-based cues relevant in RS evaluations.
The brain module processes interactions, modifying its actions based on retrieved memory evidence and visual reasoning. The decision-making process includes multi-round preference elicitation, allowing the agent to refine decisions based on contradictions and supporting evidences.
Experiments
SimUSER agents are tested across datasets like MovieLens and AmazonBook, performing tasks such as item rating, classification, and interactions typical of RS usage. Evaluation metrics illustrate that SimUSER exceeds existing models like RecAgent and Agent4Rec in aligning agent behaviors with user data, both for micro-level actions (e.g., individual ratings) and macro-level preferences (e.g., overall satisfaction).
User Simulators and Recommender System Evaluation
SimUSER is compared against various RS baselines, including Matrix Factorization and MultVAE. User proxies help in identifying impacts of visual cues and reviews on user engagement metrics. Evaluations confirm SimUSER's capacity to generate interactions that align closer with human behaviors compared to existing models, thus acting as a scalable proxy to real-world user evaluations.
Practical Relevance
SimUSER shows potential for replacing conventional A/B testing with a more scalable and privacy-preserving alternative that captures the subtle effects of UX design decisions, like thumbnails and user reviews. This methodology could significantly reduce RS evaluation costs while maintaining high fidelity in user behavioral replication.
Conclusion
By leveraging LLMs, SimUSER facilitates realistic user proxying within recommender systems, offering a new direction for evaluating RS. This work advocates the potential to bridge gaps in offline accuracy while enabling nuanced interactive evaluations. Further development is suggested in areas like cold-start scenarios and more dynamic agent interactions. SimUSER's framework stands as a promising avenue for RS designers seeking automated, ethical, and effective evaluation protocols.