Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs (2502.09597v1)

Published 13 Feb 2025 in cs.LG and cs.CL

Abstract: LLMs are increasingly used as chatbots, yet their ability to personalize responses to user preferences remains limited. We introduce PrefEval, a benchmark for evaluating LLMs' ability to infer, memorize and adhere to user preferences in a long-context conversational setting. PrefEval comprises 3,000 manually curated user preference and query pairs spanning 20 topics. PrefEval contains user personalization or preference information in both explicit and implicit forms, and evaluates LLM performance using a generation and a classification task. With PrefEval, we evaluated the aforementioned preference following capabilities of 10 open-source and proprietary LLMs in multi-session conversations with varying context lengths up to 100k tokens. We benchmark with various prompting, iterative feedback, and retrieval-augmented generation methods. Our benchmarking effort reveals that state-of-the-art LLMs face significant challenges in proactively following users' preferences during conversations. In particular, in zero-shot settings, preference following accuracy falls below 10% at merely 10 turns (~3k tokens) across most evaluated models. Even with advanced prompting and retrieval methods, preference following still deteriorates in long-context conversations. Furthermore, we show that fine-tuning on PrefEval significantly improves performance. We believe PrefEval serves as a valuable resource for measuring, understanding, and enhancing LLMs' preference following abilities, paving the way for personalized conversational agents. Our code and dataset are available at https://prefeval.github.io/.

PDF Abstract

Overview of "Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs"

The paper "Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs" presents an evaluation benchmark called PrefEval, which is designed to assess the ability of LLMs to understand and follow user preferences during extended conversations. The research underscores a significant aspect of user interactions with LLM-based systems, namely, the personalization of responses based on inferred user preferences. As LLMs become increasingly prevalent in applications such as chatbots, evaluating their capacity to adapt to user preferences becomes crucial.

PrefEval Benchmark

PrefEval comprises a dataset of 3,000 user preference and query pairs across 20 diverse topics, encompassing both explicit and implicit forms of preference articulation. This benchmark is intended to evaluate LLMs on two dimensions: generation tasks and classification tasks. For the former, the LLMs generate text based on prompts, while the latter involves multiple-choice questions. The benchmark also assesses the models across different session lengths, stressing the ability to maintain context over extended conversations reaching up to 100k tokens.

Key Findings

One of the pivotal findings from the paper is that state-of-the-art LLMs struggle with proactively following user preferences in multi-turn dialogues. Particularly in zero-shot scenarios, the models show less than 10% accuracy in preference adherence by the 10th conversational turn (~3k tokens). Although advanced techniques like prompting and retrieval-augmented generation show some improvement, performance still wanes as context lengthens.

Another significant observation concerns model personalization through fine-tuning. The paper reports notable improvements in LLMs' performance when fine-tuned on PrefEval, indicating that such targeted training can heighten an LLM's capacity to memorize and adhere to user preferences.

Error Analysis

The authors conducted a detailed error analysis, categorizing issues into preference-unaware violations, hallucinations, inconsistencies, and unhelpful responses. This analysis revealed that even when LLMs correctly inferred preferences, they frequently faltered in maintaining response consistency over longer dialogues. Moreover, attempts to directly prompt or use retrieval-augmented techniques inadvertently led to hallucinations of user preferences or unhelpful responses.

Implications and Future Research Directions

The findings from this paper suggest that while current LLMs show promise in language comprehension and generation, they remain limited in personalization, particularly in long, context-rich interactions. The PrefEval benchmark provides a robust framework for future research to focus on enhancing personalization aspects of LLMs by addressing identified limitations.

Future research could explore more sophisticated methods for long-context retrieval and preference inference, aiming to further refine the personalization capabilities. Additionally, addressing the "lost in the middle" phenomenon where LLMs struggle with information retrieval from mid-context positions is crucial for advancing LLM applications in real-world conversational agents.

As LLMs are continually updated and improved, benchmarks like PrefEval will be instrumental in verifying their progress and directing advancements toward more human-like, context-aware LLM interactions. The paper opens avenues for refined models capable of seamless personalization, thus contributing to areas such as customer service automation, personalized content delivery, and interactive AI companions.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Siyan Zhao (10 papers)
Mingyi Hong (172 papers)
Yang Liu (2253 papers)
Devamanyu Hazarika (33 papers)
Kaixiang Lin (22 papers)

Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs (2502.09597v1)

Overview of "Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs"

PrefEval Benchmark

Key Findings

Error Analysis

Implications and Future Research Directions

Related Papers

GitHub

YouTube

Reddit