Diverse Preference Optimization (2501.18101v4)

Published 30 Jan 2025 in cs.CL

Abstract: Post-training of LLMs, either through reinforcement learning, preference optimization or supervised finetuning, tends to sharpen the output probability distribution and reduce the diversity of generated responses. This is particularly a problem for creative generative tasks where varied responses are desired. In this work we introduce Diverse Preference Optimization (DivPO), an optimization method which learns to generate much more diverse responses than standard pipelines, while maintaining the quality of the generations. In DivPO, preference pairs are selected by first considering a pool of responses, and a measure of diversity among them, and selecting chosen examples as being more rare but high quality, while rejected examples are more common, but low quality. DivPO results in generating 45.6% more diverse persona attributes, and a 74.6% increase in story diversity, while maintaining similar win rates as standard baselines. On general instruction following, DivPO results in a 46.2% increase in diversity, and a 2.4% winrate improvement compared to DPO.

Summary

The paper introduces DivPO, an online preference optimization method that integrates a diversity-aware selection mechanism to balance output quality and variety.
It employs evaluation metrics like cosine similarity and BLEU to measure response diversity within candidate pools generated by the model.
DivPO achieves significant diversity gains (up to +74.6%) in creative tasks while maintaining quality levels comparable to standard DPO methods.

The paper "Diverse Preference Optimization" (2501.18101) addresses the common issue of reduced output diversity resulting from standard post-training techniques like Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO). These methods often sharpen the model's output distribution towards modes favored by preference data, potentially sacrificing variety, which is detrimental for creative generation tasks. DivPO is proposed as an online preference optimization method designed to enhance response diversity while maintaining generation quality.

Methodology: Diverse Preference Optimization (DivPO)

Standard preference optimization methods, such as DPO, learn a policy $\pi_\theta$ that aligns with a preference dataset $\mathcal{D} = \{ (x^{(i)}, y_w^{(i)}, y_l^{(i)}) \}_{i=1}^N$ , where $x$ is a prompt, $y_w$ is the preferred response, and $y_l$ is the dispreferred response. The objective is typically to maximize the log-likelihood of the preference pairs under a learned reward model implicitly defined by the policy and a reference policy $\pi_{ref}$ :

$L_{DPO}(\pi_\theta; \pi_{ref}) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)} \right) \right]$

where $\sigma$ is the sigmoid function and $\beta$ is a temperature parameter.

DivPO modifies the data collection or selection process within an online optimization loop. Instead of using pre-existing static preference pairs or selecting pairs solely based on quality estimated by a reward model, DivPO introduces a diversity-aware selection mechanism. For a given prompt $x$ , the process involves:

Sampling a Pool: Generate a pool of $k$ candidate responses $\{y_1, ..., y_k\}$ using the current policy $\pi_\theta$ .
Diversity Evaluation: For each response $y_i$ in the pool, calculate a measure of its diversity $D(y_i, \{y_j\}_{j \neq i})$ relative to the other responses in the pool. This could involve metrics like pairwise embedding dissimilarity (e.g., 1 - cosine similarity of sentence embeddings) or lexical measures.
Quality Evaluation: Estimate the quality of each response $y_i$ , typically using a reward model $R(x, y_i)$ .
Preference Pair Selection: Construct preference pairs $(x, y_w, y_l)$ based on both quality and diversity. The core idea is to select $y_w$ such that it has high quality and is relatively rare (high diversity score) within the pool. Conversely, $y_l$ is selected to be a more common response (low diversity score) but with lower quality compared to $y_w$ . The exact selection mechanism might involve ranking candidates based on a combined score of reward and diversity or specific heuristics outlined in the paper (e.g., choosing $y_w$ from high-reward, high-diversity candidates and $y_l$ from low-reward, low-diversity candidates).
Optimization: Update the policy $\pi_\theta$ using the selected preference pairs, likely employing an objective similar to the DPO loss, but potentially incorporating diversity considerations more directly if the reward signal itself is modified.

This online process iteratively refines the policy, encouraging it to explore and favor diverse, high-quality outputs over common, lower-quality ones.

Implementation and Evaluation Details

The practical implementation of DivPO requires defining the diversity metric $D$ . Common choices include:

Embedding-based: Calculate sentence embeddings (e.g., using Sentence-BERT) for all responses in the pool $\{y_1, ..., y_k\}$ . For a response $y_i$ , $D(y_i, \{y_j\}_{j \neq i})$ could be the average cosine distance to other responses $\frac{1}{k-1} \sum_{j \neq i} (1 - \text{cosine_similarity}(emb(y_i), emb(y_j)))$.
Lexical-based: Compute metrics like pairwise BLEU or ROUGE scores between responses. Diversity could be inversely related to the average overlap, e.g., $D(y_i, \{y_j\}_{j \neq i}) = 1 - \frac{1}{k-1} \sum_{j \neq i} \text{BLEU}(y_i, y_j)$ .

The online nature involves repeatedly sampling prompts, generating response pools, selecting preference pairs using the DivPO criteria, and updating the model. This contrasts with standard DPO, which often operates on a fixed dataset.

The evaluation focuses on quantifying both diversity and quality.

Diversity Metrics: Self-BLEU, distinct n-grams (distinct-1, distinct-2), or variance/spread of response embeddings are commonly used to measure the diversity of generations across multiple outputs for the same or different prompts.
Quality Metrics: Win rate against baseline models (e.g., standard DPO or SFT models) using human evaluation or a held-out reward model is crucial to ensure diversity gains do not come at the cost of quality degradation.

The paper reports significant improvements in diversity metrics on tasks like persona attribute generation (+45.6%) and story generation (+74.6%), while maintaining win rates comparable to standard baselines, suggesting that DivPO successfully balances diversity and quality.

Practical Implications and Applications

DivPO offers a practical method for fine-tuning LLMs when diverse outputs are desired, particularly in:

Creative Writing: Generating varied stories, poems, or dialogue.
Brainstorming & Ideation: Producing a wider range of suggestions or ideas.
Synthetic Data Generation: Creating more diverse training examples, which can improve downstream model robustness and generalization.
Personalized Systems: Offering users more varied recommendations or responses.

Compared to techniques like adjusting sampling temperature or nucleus sampling (which control diversity at inference time) or diverse beam search (which modifies the decoding algorithm), DivPO integrates diversity directly into the model's learned preferences during training/fine-tuning. This potentially leads to a model that intrinsically generates diverse outputs rather than relying solely on decoding strategies.

Potential challenges include the computational overhead of generating response pools and calculating diversity scores in the online loop, the sensitivity of results to the choice of diversity metric $D$ , and the need for careful tuning of the selection mechanism to balance diversity and quality effectively. The selection criteria for $y_w$ (high quality, high diversity) and $y_l$ (low quality, low diversity) are crucial and might require task-specific adaptation.

Conclusion

Diverse Preference Optimization (DivPO) presents a modification to standard preference optimization frameworks by incorporating an explicit diversity signal into the selection of preference pairs during online fine-tuning. By favoring high-quality responses that are also dissimilar to other candidates and disfavoring common, lower-quality responses, DivPO demonstrates substantial gains in output diversity on creative tasks while preserving generation quality comparable to standard methods like DPO. This makes it a promising technique for applications where response variety is a key requirement.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/jaseweston/status/1885399530419450257

https://twitter.com/slimmed/status/1886414710226206939

https://twitter.com/jaseweston/status/1885399539227464091

https://twitter.com/arxivsanitybot/status/1885681924611645638

https://twitter.com/GptMaestro/status/1890946765223813237