FSPO: Few-Shot Preference Optimization of Synthetic Preference Data in LLMs Elicits Effective Personalization to Real Users (2502.19312v1)

Published 26 Feb 2025 in cs.LG, cs.AI, cs.CL, cs.HC, and stat.ML

Abstract: Effective personalization of LLMs is critical for a broad range of user-interfacing applications such as virtual assistants and content curation. Inspired by the strong in-context learning capabilities of LLMs, we propose Few-Shot Preference Optimization (FSPO), which reframes reward modeling as a meta-learning problem. Under this framework, an LLM learns to quickly adapt to a user via a few labeled preferences from that user, constructing a personalized reward function for them. Additionally, since real-world preference data is scarce and challenging to collect at scale, we propose careful design choices to construct synthetic preference datasets for personalization, generating over 1M synthetic personalized preferences using publicly available LLMs. In particular, to successfully transfer from synthetic data to real users, we find it crucial for the data to exhibit both high diversity and coherent, self-consistent structure. We evaluate FSPO on personalized open-ended generation for up to 1,500 synthetic users across across three domains: movie reviews, pedagogical adaptation based on educational background, and general question answering, along with a controlled human study. Overall, FSPO achieves an 87% Alpaca Eval winrate on average in generating responses that are personalized to synthetic users and a 72% winrate with real human users in open-ended question answering.

Summary

The paper presents Cal-QL, framing LLM personalization as a meta-learning problem solved using synthetic data and few-shot learning for adaptation to real users.
To address limited real data, the method proposes generating diverse and structured synthetic preference datasets designed to generalize effectively to real human users.
Experimental results, including a 72% win rate in controlled human studies, validate that Cal-QL effectively personalizes LLM responses for real users across diverse tasks.

The paper introduces Few-Shot Preference Optimization (Cal-QL), a framework designed to personalize responses generated by LLMs, by modeling a distribution of reward functions to capture diverse human preferences. Cal-QL leverages in-context learning to adapt to new subpopulations. The paper frames the challenge of preference modeling as a meta-learning problem, where the LLM learns to adapt quickly to individual users based on a few labeled preferences. To address the scarcity of real-world preference data, the authors propose generating synthetic preference datasets, emphasizing high diversity and coherent structure to ensure successful transfer to real users.

Key aspects of the proposed method and evaluation include:

Cal-QL Framework: The framework is designed to model diverse subpopulations in preference datasets to elicit personalization in LLMs for open-ended question answering. Cal-QL leverages in-context learning to adapt to new subpopulations. The model is fine-tuned with a meta-learning objective, using preference-learning objectives such as IPO. User description chain-of-thought (COT) is used, allowing the model to leverage additional inference-compute for reward modeling and instruction following for response generation.
Synthetic Preference Datasets: The paper highlights the difficulty of collecting diverse preference datasets from humans and proposes a method to instantiate this dataset synthetically, emphasizing the generation of datasets that are both diverse and structured. This approach includes using domain randomization to simulate a diverse set of synthetic preferences. The goal is to create data that generalizes to real human users.
Personalization as Meta-Learning: The paper recasts personalization as a meta-learning problem, by collecting scorer-ids ${S}^{(i)}$ of each user for differentiating users that have labeled preferences in a dataset: $\mathcal{D}_{\text{pref} = \{({x}^{(i)}, {y}_w^{(i)}, {y}_l^{(i)}, {S}^{(i)})\}$. Each user is treated as a task instance, and the objective is to learn an effective reward function for each user using their preferences.
User Description Chain-of-Thought (COT): The paper introduces a two-step prediction process where, first, the user description is generated conditioned on user few-shot preferences. Then, conditioned on the prompt, few-shot preferences, and generated user description, a response is generated.
Bradley-Terry Model: The paper leverages the Bradley-Terry (BT) model which expresses the probability of preferring response ${y}_1$ $y_{1}$ over ${y}_2$ $y_{2}$ , given a prompt ${x}$ $x$ , as:

$p^*({y}_1 \succ {y}_2 \mid {x}) = \frac{e^{r^*({x}, {y}_1)}}{e^{r^*({x}, {y}_1)} + e^{r^*({x}, {y}_2)}}$

where:
- $p^*({y}_1 \succ {y}_2 \mid {x})$ is the probability that ${y}_1$ is preferred over ${y}_2$ given ${x}$ .
- $r^*({x}, {y}_1)$ and $r^*({x}, {y}_2)$ are the reward functions for responses ${y}_1$ and ${y}_2$ given prompt ${x}$ .
Domains for Personalization: The paper constructs a benchmark across three domains:
- Reviews: Generating reviews of movies, TV shows, anime, and books consistent with a user's writing style.
- Explain Like I'm X (ELIX): Generating responses consistent with a user's education level.
- Roleplay: Generating responses consistent with a user's description for general question answering.
Domain Randomization Techniques: The paper uses view-conditioning and iterative persona generation. View-Conditioning decomposes a given question into multiple viewpoints, allowing for diverse response generation. Iterative Persona Generation allows for better structure by removing underspecification of the persona by iteratively refining a persona if it is insufficient to make a preference prediction.
Experimental Results:
- Cal-QL achieves an 87\% Alpaca Eval win rate on average in generating responses personalized to synthetic users.
- A controlled human paper shows a 72\% win rate with real human users in open-ended question answering.
- In the ELIX task, Cal-QL consistently outperforms baselines in both easy and hard difficulty levels.
- In the Review task, Cal-QL allows for better performance on held-out questions.
- In the Roleplay task, scaling to 1500 real users, Cal-QL shows a win rate of 82.6\% on both held-out users and questions.

The paper concludes by discussing the limitations and potential risks of personalization, including ethical and fairness considerations and the risk of reinforcing user biases. The authors advocate for future work to explore mechanisms that balance personalization with ethical safeguards.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/chelseabfinn/status/1895188069214757057

https://twitter.com/ITica007/status/1913888696920584453