D2PO: Discriminator-Guided DPO with Response Evaluation Models (2405.01511v2)
Abstract: Varied approaches for aligning LLMs have been proposed, including supervised fine-tuning, RLHF, and direct optimization methods such as DPO. Although DPO has rapidly gained popularity due to its straightforward training process and competitive results, there is an open question of whether there remain practical advantages of using a discriminator, like a reward model, to evaluate responses. We propose D2PO, discriminator-guided DPO, an approach for the online setting where preferences are being collected throughout learning. As we collect gold preferences, we use these not only to train our policy, but to train a discriminative response evaluation model to silver-label even more synthetic data for policy training. We explore this approach across a set of diverse tasks, including a realistic chat setting, we find that our approach leads to higher-quality outputs compared to DPO with the same data budget, and greater efficiency in terms of preference data requirements. Furthermore, we show conditions under which silver labeling is most helpful: it is most effective when training the policy with DPO, outperforming traditional PPO, and benefits from maintaining a separate discriminator from the policy model.
- Back to basics: Revisiting REINFORCE style optimization for learning from human feedback in LLMs. arXiv preprint arXiv:2402.14740, 2024.
- A general theoretical paradigm to understand learning from human preferences. ArXiv, abs/2310.12036, 2023. URL https://api.semanticscholar.org/CorpusID:264288854.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. ArXiv, abs/2204.05862, 2022. URL https://api.semanticscholar.org/CorpusID:248118878.
- Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 783–792. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/brown19a.html.
- Human alignment of large language models through online preference optimisation. ArXiv, abs/2403.08635, 2024. URL https://api.semanticscholar.org/CorpusID:268379170.
- Active learning with statistical models. In Advances in Neural Information Processing Systems, volume cs.AI/9603104, 1996. URL https://api.semanticscholar.org/CorpusID:9242771.
- Ultrafeedback: Boosting language models with high-quality feedback. ArXiv, abs/2310.01377, 2023. URL https://api.semanticscholar.org/CorpusID:263605623.
- Provably sample efficient rlhf via active preference optimization. ArXiv, abs/2402.10500, 2024. URL https://api.semanticscholar.org/CorpusID:267740535.
- Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023.
- Length-Corrected AlpacaEval: A Simple Debiasing of Automatic Evaluators. https://github.com/tatsu-lab/alpaca_eval, 2024.
- Efficient exploration for LLMs. ArXiv, abs/2402.00396, 2024. URL https://api.semanticscholar.org/CorpusID:267364948.
- Understanding dataset difficulty with 𝒱𝒱\mathcal{V}caligraphic_V-usable information. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 5988–6008. PMLR, 17–23 Jul 2022.
- KTO: Model alignment as prospect theoretic optimization. ArXiv, abs/2402.01306, 2024. URL https://api.semanticscholar.org/CorpusID:267406810.
- Wikimedia Foundation. Wikimedia downloads. URL https://dumps.wikimedia.org.
- Direct language model alignment from online AI feedback. arXiv preprint arXiv:2402.04792, 2024.
- ORPO: Monolithic preference optimization without reference model. ArXiv, abs/2403.07691, 2024. URL https://api.semanticscholar.org/CorpusID:268363309.
- spaCy: Industrial-strength Natural Language Processing in Python, 2020. https://spacy.io.
- LoRA: Low-Rank Adaptation of Large Language Models. ArXiv, abs/2106.09685, 2021. URL https://api.semanticscholar.org/CorpusID:235458009.
- Camels in a changing climate: Enhancing LM adaptation with Tulu 2. arXiv preprint arXiv:2311.10702, 2023.
- RewardBench: Evaluating reward models for language modeling. ArXiv, abs/2403.13787, 2024. URL https://api.semanticscholar.org/CorpusID:268537409.
- Aligning large language models by on-policy self-judgment. ArXiv, abs/2402.11253, 2024. URL https://api.semanticscholar.org/CorpusID:267751124.
- A sequential algorithm for training text classifiers. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’94, pp. 3–12, Berlin, Heidelberg, 1994. Springer-Verlag. ISBN 038719889X.
- Policy optimization in RLHF: The impact of out-of-preference data. ArXiv, abs/2312.10584, 2023. URL https://api.semanticscholar.org/CorpusID:266348517.
- Statistical rejection sampling improves preference optimization. arXiv preprint arXiv:2309.06657, 2023.
- The generalization gap in offline reinforcement learning. ArXiv, abs/2312.05742, 2023. URL https://api.semanticscholar.org/CorpusID:266163081.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
- Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2023.
- Proximal policy optimization algorithms. ArXiv, abs/1707.06347, 2017. URL https://api.semanticscholar.org/CorpusID:28695052.
- A Long Way to Go: Investigating Length Correlations in RLHF. ArXiv, abs/2310.03716, 2023. URL https://api.semanticscholar.org/CorpusID:263672200.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Zephyr: Direct distillation of LM alignment. arXiv preprint arXiv:2310.16944, 2023.
- TRL: Transformer Reinforcement Learning. https://github.com/huggingface/trl, 2020.
- Advancing LLM reasoning generalists with preference trees. 2024a. URL https://api.semanticscholar.org/CorpusID:268856805.
- Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024b.
- OPT: Open pre-trained transformer language models. ArXiv, abs/2205.01068, 2022a. URL https://api.semanticscholar.org/CorpusID:248496292.
- A survey of active learning for natural language processing. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 6166–6190, Abu Dhabi, United Arab Emirates, December 2022b. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.414. URL https://aclanthology.org/2022.emnlp-main.414.
- SLiC-HF: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023.