Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Diverse Preference Optimization (2501.18101v4)

Published 30 Jan 2025 in cs.CL

Abstract: Post-training of LLMs, either through reinforcement learning, preference optimization or supervised finetuning, tends to sharpen the output probability distribution and reduce the diversity of generated responses. This is particularly a problem for creative generative tasks where varied responses are desired. In this work we introduce Diverse Preference Optimization (DivPO), an optimization method which learns to generate much more diverse responses than standard pipelines, while maintaining the quality of the generations. In DivPO, preference pairs are selected by first considering a pool of responses, and a measure of diversity among them, and selecting chosen examples as being more rare but high quality, while rejected examples are more common, but low quality. DivPO results in generating 45.6% more diverse persona attributes, and a 74.6% increase in story diversity, while maintaining similar win rates as standard baselines. On general instruction following, DivPO results in a 46.2% increase in diversity, and a 2.4% winrate improvement compared to DPO.

Summary

  • The paper introduces DivPO, an online preference optimization method that integrates a diversity-aware selection mechanism to balance output quality and variety.
  • It employs evaluation metrics like cosine similarity and BLEU to measure response diversity within candidate pools generated by the model.
  • DivPO achieves significant diversity gains (up to +74.6%) in creative tasks while maintaining quality levels comparable to standard DPO methods.

The paper "Diverse Preference Optimization" (2501.18101) addresses the common issue of reduced output diversity resulting from standard post-training techniques like Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO). These methods often sharpen the model's output distribution towards modes favored by preference data, potentially sacrificing variety, which is detrimental for creative generation tasks. DivPO is proposed as an online preference optimization method designed to enhance response diversity while maintaining generation quality.

Methodology: Diverse Preference Optimization (DivPO)

Standard preference optimization methods, such as DPO, learn a policy πθ\pi_\theta that aligns with a preference dataset D={(x(i),yw(i),yl(i))}i=1N\mathcal{D} = \{ (x^{(i)}, y_w^{(i)}, y_l^{(i)}) \}_{i=1}^N, where xx is a prompt, ywy_w is the preferred response, and yly_l is the dispreferred response. The objective is typically to maximize the log-likelihood of the preference pairs under a learned reward model implicitly defined by the policy and a reference policy πref\pi_{ref}:

LDPO(πθ;πref)=E(x,yw,yl)D[logσ(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))]L_{DPO}(\pi_\theta; \pi_{ref}) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)} \right) \right]

where σ\sigma is the sigmoid function and β\beta is a temperature parameter.

DivPO modifies the data collection or selection process within an online optimization loop. Instead of using pre-existing static preference pairs or selecting pairs solely based on quality estimated by a reward model, DivPO introduces a diversity-aware selection mechanism. For a given prompt xx, the process involves:

  1. Sampling a Pool: Generate a pool of kk candidate responses {y1,...,yk}\{y_1, ..., y_k\} using the current policy πθ\pi_\theta.
  2. Diversity Evaluation: For each response yiy_i in the pool, calculate a measure of its diversity D(yi,{yj}ji)D(y_i, \{y_j\}_{j \neq i}) relative to the other responses in the pool. This could involve metrics like pairwise embedding dissimilarity (e.g., 1 - cosine similarity of sentence embeddings) or lexical measures.
  3. Quality Evaluation: Estimate the quality of each response yiy_i, typically using a reward model R(x,yi)R(x, y_i).
  4. Preference Pair Selection: Construct preference pairs (x,yw,yl)(x, y_w, y_l) based on both quality and diversity. The core idea is to select ywy_w such that it has high quality and is relatively rare (high diversity score) within the pool. Conversely, yly_l is selected to be a more common response (low diversity score) but with lower quality compared to ywy_w. The exact selection mechanism might involve ranking candidates based on a combined score of reward and diversity or specific heuristics outlined in the paper (e.g., choosing ywy_w from high-reward, high-diversity candidates and yly_l from low-reward, low-diversity candidates).
  5. Optimization: Update the policy πθ\pi_\theta using the selected preference pairs, likely employing an objective similar to the DPO loss, but potentially incorporating diversity considerations more directly if the reward signal itself is modified.

This online process iteratively refines the policy, encouraging it to explore and favor diverse, high-quality outputs over common, lower-quality ones.

Implementation and Evaluation Details

The practical implementation of DivPO requires defining the diversity metric DD. Common choices include:

  • Embedding-based: Calculate sentence embeddings (e.g., using Sentence-BERT) for all responses in the pool {y1,...,yk}\{y_1, ..., y_k\}. For a response yiy_i, D(yi,{yj}ji)D(y_i, \{y_j\}_{j \neq i}) could be the average cosine distance to other responses $\frac{1}{k-1} \sum_{j \neq i} (1 - \text{cosine_similarity}(emb(y_i), emb(y_j)))$.
  • Lexical-based: Compute metrics like pairwise BLEU or ROUGE scores between responses. Diversity could be inversely related to the average overlap, e.g., D(yi,{yj}ji)=11k1jiBLEU(yi,yj)D(y_i, \{y_j\}_{j \neq i}) = 1 - \frac{1}{k-1} \sum_{j \neq i} \text{BLEU}(y_i, y_j).

The online nature involves repeatedly sampling prompts, generating response pools, selecting preference pairs using the DivPO criteria, and updating the model. This contrasts with standard DPO, which often operates on a fixed dataset.

The evaluation focuses on quantifying both diversity and quality.

  • Diversity Metrics: Self-BLEU, distinct n-grams (distinct-1, distinct-2), or variance/spread of response embeddings are commonly used to measure the diversity of generations across multiple outputs for the same or different prompts.
  • Quality Metrics: Win rate against baseline models (e.g., standard DPO or SFT models) using human evaluation or a held-out reward model is crucial to ensure diversity gains do not come at the cost of quality degradation.

The paper reports significant improvements in diversity metrics on tasks like persona attribute generation (+45.6%) and story generation (+74.6%), while maintaining win rates comparable to standard baselines, suggesting that DivPO successfully balances diversity and quality.

Practical Implications and Applications

DivPO offers a practical method for fine-tuning LLMs when diverse outputs are desired, particularly in:

  • Creative Writing: Generating varied stories, poems, or dialogue.
  • Brainstorming & Ideation: Producing a wider range of suggestions or ideas.
  • Synthetic Data Generation: Creating more diverse training examples, which can improve downstream model robustness and generalization.
  • Personalized Systems: Offering users more varied recommendations or responses.

Compared to techniques like adjusting sampling temperature or nucleus sampling (which control diversity at inference time) or diverse beam search (which modifies the decoding algorithm), DivPO integrates diversity directly into the model's learned preferences during training/fine-tuning. This potentially leads to a model that intrinsically generates diverse outputs rather than relying solely on decoding strategies.

Potential challenges include the computational overhead of generating response pools and calculating diversity scores in the online loop, the sensitivity of results to the choice of diversity metric DD, and the need for careful tuning of the selection mechanism to balance diversity and quality effectively. The selection criteria for ywy_w (high quality, high diversity) and yly_l (low quality, low diversity) are crucial and might require task-specific adaptation.

Conclusion

Diverse Preference Optimization (DivPO) presents a modification to standard preference optimization frameworks by incorporating an explicit diversity signal into the selection of preference pairs during online fine-tuning. By favoring high-quality responses that are also dissimilar to other candidates and disfavoring common, lower-quality responses, DivPO demonstrates substantial gains in output diversity on creative tasks while preserving generation quality comparable to standard methods like DPO. This makes it a promising technique for applications where response variety is a key requirement.