PrefEval Benchmark for LLM Personalization

Updated 5 November 2025

PrefEval is a benchmark designed to evaluate LLMs' ability to infer, retain, and apply specific user preferences in multi-turn, distractor-heavy conversations.
It utilizes both generation and classification tasks, scoring models on error types like preference-unaware violations and hallucinations under realistic conditions.
Empirical results show severe limitations in LLMs’ long-context preference recall, while methods like SFT and RAG offer moderate improvements.

PrefEval is a benchmark developed to rigorously evaluate the capacity of LLMs to infer, retain, and apply user preferences in long-context, multi-session conversational settings. It is designed to go beyond stylized or role-based personalization, focusing instead on preference adherence relevant to real-world, open-domain, and distractor-rich interactions. PrefEval incorporates both explicit and implicit forms of preference disclosure and enables evaluation via multiple tasks, notably generation and classification, with a focus on scalability, interpretability, and diagnostic error analysis.

1. Design, Structure, and Data Composition

PrefEval consists of 3,000 manually curated preference-query pairs spanning 20 real-world recommendation and advice domains, including travel, entertainment, food, fitness, education, shopping, and professional advice. Each unique user preference is represented in three forms:

Explicit Preference: Direct statement (e.g., "I dislike spicy food.")
Implicit Choice-Based Dialogue: Preference inferred from option selection/rejection across two turns.
Implicit Persona-Driven Dialogue: Preference embedded in extended (4–8 turn) persona-centric exchanges.

Crucially, between the initial preference disclosure and the subsequent query, each conversation injects unrelated "distractor" dialogue turns—up to 100,000 tokens sourced from actual LMSYS-Chat-1M logs—to simulate chat histories in which preferences can be deeply embedded or “lost in the middle.”

Instances in the (multiple-choice) classification subset adopt a triple structure:

Question: The user’s information or assistance request.
User Preference: Explicitly stated or implicit within prior conversation.
Candidate Answers: Four options, of which exactly one aligns with the true user preference. Each gold answer includes a human-written explanation for reference.

2. Evaluation Protocols and Tasks

PrefEval supports two principal evaluation types:

A. Generation Task:

Given the full conversational history—including explicit/implicit user preferences—an LLM must generate a context-appropriate answer. Evaluation employs an LLM-as-judge (e.g., Claude 3 Sonnet) performing four binary checks for:

Preference-unaware violation
Preference hallucination violation
Inconsistency violation
Unhelpful response

A response is correct only if no error types are present.

B. Classification Task (Multiple-Choice):

Supplied with context, a query, and four candidate completions (one aligned, three distractors), the system must select the answer consistent with user preference.

A strong positive correlation (Spearman/Pearson ρ = 0.73) is observed between generation accuracy and classification MCQ accuracy, validating the classification task as an effective and scalable proxy for automated assessment.

3. Experimental Setup and Baseline Methods

PrefEval assessments involve open-source and commercial LLMs (e.g., Claude 3 Haiku/Sonnet, GPT-4o, Gemini-1.5-Pro, LLaMA 3 8B/70B Instruct, Mistral 7B/8x7B).

Evaluation is conducted at variable context gaps between preference and query (from 3k up to 100k tokens, 3 to 300 turns), to challenge models’ long-context memory and retrieval. Five major prompting and retrieval methods are benchmarked:

Zero-shot: Direct response, no attention called to preference.
Reminder: An inserted prompt reminding the model to account for user preferences.
Self-Critic: The model generates, self-reviews, and revises its answer for preference alignment.
Few-Shot Chain-of-Thought: Preceding demonstration examples of step-wise preference following.
Retrieval-Augmented Generation (RAG): Most relevant previous user utterances are retrieved (SimCSE embedding-based) and provided as additional context.

4. Empirical Findings and Error Analysis

Extensive experimentation reveals that LLMs across architectures display significant deficiencies in proactive preference recall and adherence under realistic, distractor-heavy conditions:

In zero-shot and even reminder settings, accuracy drops below 10% after 10 conversational turns (~3k tokens) in nearly all evaluated models, with minor exceptions (GPT-o1-preview: 50% at 10 turns).
At 300 turns, even Reminder prompts yield a maximum of 2–23% accuracy.
RAG aids when models have limited internal retrieval capabilities, but performance deteriorates with increasing context even when relevant preference is present in the prompt.
Error analysis attributes most zero-shot failures to "preference-unaware violations." Advanced prompts reduce such errors (by making models more attentive to instructions) but increase "preference hallucinations" (i.e., fabricated preferences). Nontrivial instances of "unhelpful" responses—wherein the LLM declines to answer—are also observed.
Implicit preference forms (choice-based and persona-driven) induce larger performance drops than explicit ones, with further degradation when preferences are positioned "mid-context," paralleling the "lost in the middle" phenomenon known in factual retrieval.

A notable empirical observation is that the salience of user preferences may improve when multiple or even conflicting preferences are present in history, possibly reinforced through redundancy or repetition.

Sample error-type assessment table:

Error Type	Violation	Acknowledge	Hallucinate	Helpful
Preference-unaware	Yes	No	N/A	Yes
Preference hallucination	Yes	Yes	Yes	Yes
Inconsistency	Yes	Yes	No	Yes
Unhelpful response	No	Yes/No	N/A	No

5. Impact of Supervised Fine-Tuning and Metrics

Supervised fine-tuning (SFT) of LLMs on PrefEval data produces substantial gains in preference-following, particularly for long-context and implicit preference tasks:

Mistral 7B fine-tuned on 80% of topics (and evaluated on the remaining 20%) surpasses Reminder and RAG techniques and generalizes to longer contexts than seen in training.
Attention analysis post-fine-tuning reveals an increase of up to 4.97% in attention weights allocated to preference-disclosing regions.
SFT improves, but does not eliminate, limitations for challenging implicit preference types.

Primary evaluation metrics are:

Preference-following accuracy in the generation task, formally

$\text{accuracy} = \frac{\text{Number of responses with no error type}}{\text{Total responses}}$

Classification accuracy (MCQ; human-aligned).
Strong Spearman/Pearson correlation between the two tasks across all tested methods.

6. Availability, Usage, and Research Implications

PrefEval is available at https://prefeval.github.io/ and https://github.com/amazon-science/PrefEval, together with evaluation scripts and prompting templates as used in the main studies.

The benchmark sets a new diagnostic standard for personalization in LLMs:

Enables development and assessment of new fine-tuning, RAG, and context management techniques tailored for personalized assistants.
Supports detailed error-type breakdowns by topic, preference form, or conversation configuration.
Its classification proxy task allows scalable, repeatable automated evaluation—crucial for rapid iteration or large-scale ablation studies.

Key bottlenecks highlighted include robust preference inference, effective long-context retrieval, and context-aware preference application. While SFT offers measurable improvements, the persistence of breakdowns for implicit forms and at large context sizes remains a critical area for advancement.

7. Relation to Reference-Free Evaluation: Use in PREF

The PrefEval benchmark has been adopted by the PREF reference-free evaluation framework (Fu et al., 8 Aug 2025) to provide both assessment data and a task design for evaluating personalized NLG. In this context:

PREF is tested on the implicit multiple-choice subset of PrefEval (200 questions, 800 answers, 80/20 train/test split).
The triple-structured data (question, preference, answers) aligns with PREF’s three-step rubric scoring pipeline: universal guideline construction, preference-conditioned rubric personalization, and LLM-judge scoring.
PREF achieves absolute accuracy improvements of 3–8 points (over Reminder), exceeding 55 points over Zero-shot, and exhibits strong calibration (MSE) and ranking performance (nDCG) even when using relatively small backbone models.
PREF’s factor-weighted personalized rubrics, validated against the human-justified explanations in PrefEval, show moderate to strong correspondence in reason alignment (Pearson r up to 0.62).

A plausible implication is that the PrefEval benchmark not only exposes the fundamental weaknesses and error modes of current LLMs in personalized, realistic dialogue but also grounds the design and assessment of advanced, interpretable evaluation methodologies for personalized generation and ranking.

PDF Markdown Chat (Pro)

References (1)

PREF: Reference-Free Evaluation of Personalised Text Generation in LLMs (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to PrefEval Benchmark.