Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluating Large Language Models as Generative User Simulators for Conversational Recommendation (2403.09738v4)

Published 13 Mar 2024 in cs.CL, cs.AI, and cs.IR
Evaluating Large Language Models as Generative User Simulators for Conversational Recommendation

Abstract: Synthetic users are cost-effective proxies for real users in the evaluation of conversational recommender systems. LLMs show promise in simulating human-like behavior, raising the question of their ability to represent a diverse population of users. We introduce a new protocol to measure the degree to which LLMs can accurately emulate human behavior in conversational recommendation. This protocol is comprised of five tasks, each designed to evaluate a key property that a synthetic user should exhibit: choosing which items to talk about, expressing binary preferences, expressing open-ended preferences, requesting recommendations, and giving feedback. Through evaluation of baseline simulators, we demonstrate these tasks effectively reveal deviations of LLMs from human behavior, and offer insights on how to reduce the deviations with model selection and prompting strategies.

Evaluating LLMs as Generative User Simulators for Conversational Recommendation

Introduction to the Evaluation Protocol

The utility of synthetic users is increasingly recognized in the evaluation of conversational recommender systems (CRSs), given their efficacy as cost-effective proxies for real user interactions. The burgeoning capabilities of LLMs in simulating human-like behavior prompt a reevaluation of their potential in representing diverse user interactions within conversational recommendation environments. This paper introduces a novel evaluation protocol aimed at assessing the extent to which LLMs can emulate human behavior across five distinct tasks, each designed to capture a critical aspect of user simulation: item selection, binary preference expression, open-ended preference articulation, recommendation requests, and feedback provision.

Task-Specific Evaluation

Each task within the protocol evaluates a fundamental property essential for a synthetic user in CRSs. The array of tasks includes:

  • ItemsTalk, which assesses a simulator's choice of items to discuss.
  • BinPref, focusing on the expression of binary item preferences.
  • OpenPref, evaluating the articulation of open-ended preferences.
  • RecRequest, testing the ability to request recommendations.
  • Feedback, examining the provision of coherent feedback on recommendations received.

The execution of these tasks against a dataset of real user interactions allows for the direct comparison of behavior between synthetic users generated by LLMs and actual human participants.

Insights from Baseline Simulators

Application of the evaluation tasks to baseline simulators reveals areas where LLMs deviate from human-like behavior:

  • LLM simulators exhibit a tendency to mention popular items more frequently than would be expected based on human data, highlighting a deviation in item diversity.
  • Simulators often display poor alignment with human preferences, particularly in tasks requiring binary responses.
  • The generation of recommendation requests by LLMs demonstrates a lack of personalization and granularity, with synthetic users producing more generalized requests.
  • Feedback provided by simulators on recommendations occasionally lacks coherence, with instances of incoherent feedback pointing to a misunderstanding of the nuanced nature of requests.

Enhancing Simulator Realism

The findings underscore the fact that while LLMs present a promising avenue for user simulation in CRS, there is substantial room for improvement in aligning their output with human behavior. The identification of discrepancies through the proposed tasks offers a pathway toward refining simulator design, with promising results observed when incorporating elements such as pickiness personality and interaction history into prompt design. Such strategies demonstrate potential in reducing the gap between simulator outputs and the diverse, personalized responses characteristic of human users.

Concluding Reflections

This work makes a significant stride in establishing a framework for the rigorous evaluation of LLM-based user simulators within conversational recommendation settings. By highlighting current limitations and suggesting avenues for refinement, it lays the groundwork for future research aimed at enhancing the realism of synthetic users. The pursuit of more human-like synthetic user behavior not only serves to improve the development of CRS but also contributes to the broader discourse on the capacities and limitations of LLMs in approximating complex human interactions.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Se-eun Yoon (10 papers)
  2. Zhankui He (27 papers)
  3. Jessica Maria Echterhoff (3 papers)
  4. Julian McAuley (238 papers)
Citations (7)