D2PO: Discriminator-Guided DPO with Response Evaluation Models (2405.01511v2)

Published 2 May 2024 in cs.CL

Abstract: Varied approaches for aligning LLMs have been proposed, including supervised fine-tuning, RLHF, and direct optimization methods such as DPO. Although DPO has rapidly gained popularity due to its straightforward training process and competitive results, there is an open question of whether there remain practical advantages of using a discriminator, like a reward model, to evaluate responses. We propose D2PO, discriminator-guided DPO, an approach for the online setting where preferences are being collected throughout learning. As we collect gold preferences, we use these not only to train our policy, but to train a discriminative response evaluation model to silver-label even more synthetic data for policy training. We explore this approach across a set of diverse tasks, including a realistic chat setting, we find that our approach leads to higher-quality outputs compared to DPO with the same data budget, and greater efficiency in terms of preference data requirements. Furthermore, we show conditions under which silver labeling is most helpful: it is most effective when training the policy with DPO, outperforming traditional PPO, and benefits from maintaining a separate discriminator from the policy model.

References (37)

Citations (3)

View on Semantic Scholar

Summary

The paper presents D2PO, a method that uses a discriminator to generate silver labels and improve training efficiency.
It trains the discriminator with gold-label human judgments to continuously adapt to evolving model outputs.
The approach reduces reliance on costly human annotations while enabling sustained, high-quality language model performance.

Harnessing Discriminators for Efficient LLM Training

Understanding the Issue with Static Preferences

LLM training often relies on static preferences – a fixed, pre-collected set of human judgments about which outputs are better. These judgments train a model either to generate better responses directly or to evaluate responses via a reward model. However, a significant challenge arises when the model's output distribution shifts during training, meaning the types of responses it generates change. The static preference data might no longer align well with these new responses, making the training less effective over time.

Introducing Discriminator-guided Direct Preference Optimization (D2PO)

To address the inefficiencies of static preferences, the paper introduces a novel approach called Discriminator-guided Direct Preference Optimization (D2PO). This method involves continuously updating the model's understanding of good and bad responses throughout the training process by integrating a discriminator. The discriminator is trained not just to assist in generating or evaluating responses but to actively label new data generated during training. This process involves:

Collecting Gold-label Preferences: Initially and at various stages, human judgments (gold labels) are secured to guide the training.
Training the Discriminator: These gold-label preferences are used to fine-tune the discriminator so that it can accurately assess the quality of responses.
Silver-labeling by Discriminator: The trained discriminator then labels additional generated responses (silver labels). These are used to further train the LLM without the need for expensive human annotations.

The core hypothesis here is that even with limited human-labeled data, a well-trained discriminator can effectively bootstrap the training process by generating valuable silver-labeled data.

Key Findings and Implementation

To validate this approach, several experiments were conducted across various text generation tasks. The key findings are:

Improved Efficiency: D2PO, when compared to methods relying solely on static preferences or traditional online preference collection, shows improved efficiency. It accelerates the learning process by making better use of both human-labeled and discriminator-generated data.
High-Quality Outputs: The discriminator's ongoing training ensures that it remains effective even as the model's output distribution evolves. This leads to better overall performance in generating high-quality responses.

The implementation details reveal a blend of traditional and innovative techniques in machine learning, including preference data sampling, updating strategies, and loss optimizations, all orchestrated to enhance the interactive learning loop between the model and the discriminator.

Practical Implications and Future Directions

The implications of this research are twofold:

Reduced Reliance on Human Labels: By maximizing the utility of discriminator-generated silver labels, D2PO decreases dependence on expensive human annotations, which could make large-scale LLM training more feasible and cost-effective.
Continuous Learning and Adaptation: The ability of the discriminator to adapt to the model’s shifting output distribution suggests a framework where models can continually learn and improve from ongoing interactions, reflecting a more realistic and sustainable learning environment.

As for future developments, the integration of more complex or task-specific discriminators, the exploration of different types of preference data, and further optimizations in training efficiency could be explored. Additionally, applying this framework to more diverse language tasks or in more constrained computational settings could expand its applicability and impact.

In conclusion, D2PO highlights an exciting direction for training LLMs more effectively by leveraging the strengths of discriminators in an ongoing, interactive training setting. This approach not only promises enhancements in training efficiency but also opens up new pathways for developing more adaptable and robust LLMs.

PDF Markdown

Related Papers

Tweets

https://twitter.com/prasann_singhal/status/1786425771856134422

https://twitter.com/gregd_nlp/status/1833970389917016254

https://twitter.com/realmofresearch/status/1786812507715170754

https://twitter.com/gm8xx8/status/1786205087649468806