Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

D2PO: Discriminator-Guided DPO with Response Evaluation Models (2405.01511v2)

Published 2 May 2024 in cs.CL

Abstract: Varied approaches for aligning LLMs have been proposed, including supervised fine-tuning, RLHF, and direct optimization methods such as DPO. Although DPO has rapidly gained popularity due to its straightforward training process and competitive results, there is an open question of whether there remain practical advantages of using a discriminator, like a reward model, to evaluate responses. We propose D2PO, discriminator-guided DPO, an approach for the online setting where preferences are being collected throughout learning. As we collect gold preferences, we use these not only to train our policy, but to train a discriminative response evaluation model to silver-label even more synthetic data for policy training. We explore this approach across a set of diverse tasks, including a realistic chat setting, we find that our approach leads to higher-quality outputs compared to DPO with the same data budget, and greater efficiency in terms of preference data requirements. Furthermore, we show conditions under which silver labeling is most helpful: it is most effective when training the policy with DPO, outperforming traditional PPO, and benefits from maintaining a separate discriminator from the policy model.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Back to basics: Revisiting REINFORCE style optimization for learning from human feedback in LLMs. arXiv preprint arXiv:2402.14740, 2024.
  2. A general theoretical paradigm to understand learning from human preferences. ArXiv, abs/2310.12036, 2023. URL https://api.semanticscholar.org/CorpusID:264288854.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. ArXiv, abs/2204.05862, 2022. URL https://api.semanticscholar.org/CorpusID:248118878.
  4. Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.  783–792. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/brown19a.html.
  5. Human alignment of large language models through online preference optimisation. ArXiv, abs/2403.08635, 2024. URL https://api.semanticscholar.org/CorpusID:268379170.
  6. Active learning with statistical models. In Advances in Neural Information Processing Systems, volume cs.AI/9603104, 1996. URL https://api.semanticscholar.org/CorpusID:9242771.
  7. Ultrafeedback: Boosting language models with high-quality feedback. ArXiv, abs/2310.01377, 2023. URL https://api.semanticscholar.org/CorpusID:263605623.
  8. Provably sample efficient rlhf via active preference optimization. ArXiv, abs/2402.10500, 2024. URL https://api.semanticscholar.org/CorpusID:267740535.
  9. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023.
  10. Length-Corrected AlpacaEval: A Simple Debiasing of Automatic Evaluators. https://github.com/tatsu-lab/alpaca_eval, 2024.
  11. Efficient exploration for LLMs. ArXiv, abs/2402.00396, 2024. URL https://api.semanticscholar.org/CorpusID:267364948.
  12. Understanding dataset difficulty with 𝒱𝒱\mathcal{V}caligraphic_V-usable information. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  5988–6008. PMLR, 17–23 Jul 2022.
  13. KTO: Model alignment as prospect theoretic optimization. ArXiv, abs/2402.01306, 2024. URL https://api.semanticscholar.org/CorpusID:267406810.
  14. Wikimedia Foundation. Wikimedia downloads. URL https://dumps.wikimedia.org.
  15. Direct language model alignment from online AI feedback. arXiv preprint arXiv:2402.04792, 2024.
  16. ORPO: Monolithic preference optimization without reference model. ArXiv, abs/2403.07691, 2024. URL https://api.semanticscholar.org/CorpusID:268363309.
  17. spaCy: Industrial-strength Natural Language Processing in Python, 2020. https://spacy.io.
  18. LoRA: Low-Rank Adaptation of Large Language Models. ArXiv, abs/2106.09685, 2021. URL https://api.semanticscholar.org/CorpusID:235458009.
  19. Camels in a changing climate: Enhancing LM adaptation with Tulu 2. arXiv preprint arXiv:2311.10702, 2023.
  20. RewardBench: Evaluating reward models for language modeling. ArXiv, abs/2403.13787, 2024. URL https://api.semanticscholar.org/CorpusID:268537409.
  21. Aligning large language models by on-policy self-judgment. ArXiv, abs/2402.11253, 2024. URL https://api.semanticscholar.org/CorpusID:267751124.
  22. A sequential algorithm for training text classifiers. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’94, pp.  3–12, Berlin, Heidelberg, 1994. Springer-Verlag. ISBN 038719889X.
  23. Policy optimization in RLHF: The impact of out-of-preference data. ArXiv, abs/2312.10584, 2023. URL https://api.semanticscholar.org/CorpusID:266348517.
  24. Statistical rejection sampling improves preference optimization. arXiv preprint arXiv:2309.06657, 2023.
  25. The generalization gap in offline reinforcement learning. ArXiv, abs/2312.05742, 2023. URL https://api.semanticscholar.org/CorpusID:266163081.
  26. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  27. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2023.
  28. Proximal policy optimization algorithms. ArXiv, abs/1707.06347, 2017. URL https://api.semanticscholar.org/CorpusID:28695052.
  29. A Long Way to Go: Investigating Length Correlations in RLHF. ArXiv, abs/2310.03716, 2023. URL https://api.semanticscholar.org/CorpusID:263672200.
  30. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  31. Zephyr: Direct distillation of LM alignment. arXiv preprint arXiv:2310.16944, 2023.
  32. TRL: Transformer Reinforcement Learning. https://github.com/huggingface/trl, 2020.
  33. Advancing LLM reasoning generalists with preference trees. 2024a. URL https://api.semanticscholar.org/CorpusID:268856805.
  34. Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024b.
  35. OPT: Open pre-trained transformer language models. ArXiv, abs/2205.01068, 2022a. URL https://api.semanticscholar.org/CorpusID:248496292.
  36. A survey of active learning for natural language processing. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  6166–6190, Abu Dhabi, United Arab Emirates, December 2022b. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.414. URL https://aclanthology.org/2022.emnlp-main.414.
  37. SLiC-HF: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023.
Citations (3)

Summary

  • The paper presents D2PO, a method that uses a discriminator to generate silver labels and improve training efficiency.
  • It trains the discriminator with gold-label human judgments to continuously adapt to evolving model outputs.
  • The approach reduces reliance on costly human annotations while enabling sustained, high-quality language model performance.

Harnessing Discriminators for Efficient LLM Training

Understanding the Issue with Static Preferences

LLM training often relies on static preferences – a fixed, pre-collected set of human judgments about which outputs are better. These judgments train a model either to generate better responses directly or to evaluate responses via a reward model. However, a significant challenge arises when the model's output distribution shifts during training, meaning the types of responses it generates change. The static preference data might no longer align well with these new responses, making the training less effective over time.

Introducing Discriminator-guided Direct Preference Optimization (D2PO)

To address the inefficiencies of static preferences, the paper introduces a novel approach called Discriminator-guided Direct Preference Optimization (D2PO). This method involves continuously updating the model's understanding of good and bad responses throughout the training process by integrating a discriminator. The discriminator is trained not just to assist in generating or evaluating responses but to actively label new data generated during training. This process involves:

  1. Collecting Gold-label Preferences: Initially and at various stages, human judgments (gold labels) are secured to guide the training.
  2. Training the Discriminator: These gold-label preferences are used to fine-tune the discriminator so that it can accurately assess the quality of responses.
  3. Silver-labeling by Discriminator: The trained discriminator then labels additional generated responses (silver labels). These are used to further train the LLM without the need for expensive human annotations.

The core hypothesis here is that even with limited human-labeled data, a well-trained discriminator can effectively bootstrap the training process by generating valuable silver-labeled data.

Key Findings and Implementation

To validate this approach, several experiments were conducted across various text generation tasks. The key findings are:

  • Improved Efficiency: D2PO, when compared to methods relying solely on static preferences or traditional online preference collection, shows improved efficiency. It accelerates the learning process by making better use of both human-labeled and discriminator-generated data.
  • High-Quality Outputs: The discriminator's ongoing training ensures that it remains effective even as the model's output distribution evolves. This leads to better overall performance in generating high-quality responses.

The implementation details reveal a blend of traditional and innovative techniques in machine learning, including preference data sampling, updating strategies, and loss optimizations, all orchestrated to enhance the interactive learning loop between the model and the discriminator.

Practical Implications and Future Directions

The implications of this research are twofold:

  1. Reduced Reliance on Human Labels: By maximizing the utility of discriminator-generated silver labels, D2PO decreases dependence on expensive human annotations, which could make large-scale LLM training more feasible and cost-effective.
  2. Continuous Learning and Adaptation: The ability of the discriminator to adapt to the model’s shifting output distribution suggests a framework where models can continually learn and improve from ongoing interactions, reflecting a more realistic and sustainable learning environment.

As for future developments, the integration of more complex or task-specific discriminators, the exploration of different types of preference data, and further optimizations in training efficiency could be explored. Additionally, applying this framework to more diverse language tasks or in more constrained computational settings could expand its applicability and impact.

In conclusion, D2PO highlights an exciting direction for training LLMs more effectively by leveraging the strengths of discriminators in an ongoing, interactive training setting. This approach not only promises enhancements in training efficiency but also opens up new pathways for developing more adaptable and robust LLMs.