Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Hybrid Preference Optimization: Augmenting Direct Preference Optimization with Auxiliary Objectives (2405.17956v2)

Published 28 May 2024 in cs.AI

Abstract: For aligning LLMs, prior work has leveraged reinforcement learning via human feedback (RLHF) or variations of direct preference optimization (DPO). While DPO offers a simpler framework based on maximum likelihood estimation, it compromises on the ability to tune LLMs to easily maximize non-differentiable and non-binary objectives according to the LLM designer's preferences (e.g., using simpler language or minimizing specific kinds of harmful content). These may neither align with user preferences nor even be able to be captured tractably by binary preference data. To leverage the simplicity and performance of DPO with the generalizability of RL, we propose a hybrid approach between DPO and RLHF. With a simple augmentation to the implicit reward decomposition of DPO, we allow for tuning LLMs to maximize a set of arbitrary auxiliary rewards using offline RL. The proposed method, Hybrid Preference Optimization (HPO), shows the ability to effectively generalize to both user preferences and auxiliary designer objectives, while preserving alignment performance across a range of challenging benchmarks and model sizes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Preference-based policy learning. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2011, Athens, Greece, September 5-9, 2011. Proceedings, Part I 11, pages 12–27. Springer.
  2. Direct preference optimization with an offset. arXiv preprint arXiv:2402.10571.
  3. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.
  4. A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036.
  5. Improving language models with advantage-based offline policy gradients. arXiv preprint arXiv:2305.14718.
  6. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
  7. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345.
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  9. Preference-based policy iteration: Leveraging preference learning for reinforcement learning. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2011, Athens, Greece, September 5-9, 2011. Proceedings, Part I 11, pages 312–327. Springer.
  10. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
  11. Rvs: What is essential for offline rl via supervised learning? arXiv preprint arXiv:2112.10751.
  12. Understanding dataset difficulty with v-usable information. In International Conference on Machine Learning, pages 5988–6008. PMLR.
  13. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306.
  14. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858.
  15. Prospect theory: An analysis of decision under risk. Econometrica, 47(2):263–292.
  16. Rs-dpo: A hybrid rejection sampling and direct preference optimization method for alignment of large language models. arXiv preprint arXiv:2402.10038.
  17. Openassistant conversations-democratizing large language model alignment. Advances in Neural Information Processing Systems, 36.
  18. Pretraining language models with human preferences. In International Conference on Machine Learning, pages 17506–17533. PMLR.
  19. Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169.
  20. Statistical rejection sampling improves preference optimization. arXiv preprint arXiv:2309.06657.
  21. Enhancing llm safety via constrained direct preference optimization. arXiv preprint arXiv:2403.02475.
  22. Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359.
  23. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
  24. Plackett, R. L. (1975). The analysis of permutations. Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202.
  25. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  26. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
  27. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. Advances in Neural Information Processing Systems, 36.
  28. Trust region policy optimization. In International conference on machine learning, pages 1889–1897. PMLR.
  29. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  30. Algorithms. Addison-wesley professional.
  31. Offline rl for natural language generation with implicit language q learning. arXiv preprint arXiv:2206.11871.
  32. Icdpo: Effectively borrowing alignment capability of others via in-context direct preference optimization. arXiv preprint arXiv:2402.09320.
  33. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  34. Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints. arXiv preprint arXiv:2309.16240.
  35. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36.
  36. Beyond one-preference-for-all: Multi-objective direct preference optimization. arXiv preprint arXiv:2310.03708.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Anirudhan Badrinath (6 papers)
  2. Prabhat Agarwal (9 papers)
  3. Jiajing Xu (11 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets