Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model (2403.19443v1)

Published 28 Mar 2024 in cs.CL

Abstract: LLMs have become increasingly popular due to their ability to process and generate natural language. However, as they are trained on massive datasets of text, LLMs can inherit harmful biases and produce outputs that are not aligned with human values. This paper studies two main approaches to LLM alignment: Reinforcement Learning with Human Feedback (RLHF) and contrastive learning-based methods like Direct Preference Optimization (DPO). By analyzing the stability and robustness of RLHF and DPO, we propose MPO (Mixed Preference Optimization), a novel method that mitigates the weaknesses of both approaches. Specifically, we propose a two-stage training procedure: first train DPO on an easy dataset, and then perform RLHF on a difficult set with DPO model being the reference model. Here, the easy and difficult sets are constructed by a well-trained reward model that splits response pairs into those with large gaps of reward (easy), and those with small gaps (difficult). The first stage allows us to obtain a relatively optimal policy (LLM) model quickly, whereas the second stage refines LLM with online RLHF, thus mitigating the distribution shift issue associated with DPO. Experiments are conducted on two public alignment datasets, namely HH-RLHF and TLDR, demonstrating the effectiveness of MPO, both in terms of GPT4 and human evaluation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  4. Better rewards yield better summaries: Learning to summarise without references. arXiv preprint arXiv:1909.01214.
  5. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).
  6. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  7. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
  8. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767.
  9. Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378.
  10. Learning from dialogue after deployment: Feed yourself, chatbot! arXiv preprint arXiv:1901.05415.
  11. Statistical rejection sampling improves preference optimization. arXiv preprint arXiv:2309.06657.
  12. Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
  13. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  14. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
  15. Proximal policy optimization algorithms.
  16. Preference ranking optimization for human alignment. arXiv preprint arXiv:2306.17492.
  17. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
  18. Principle-driven self-alignment of language models from scratch with minimal human supervision. arXiv preprint arXiv:2305.03047.
  19. Stanford alpaca: An instruction-following llama model.
  20. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  21. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  22. Towards coherent and engaging spoken dialog response generation using automatic conversation evaluators. arXiv preprint arXiv:1904.13015.
  23. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302.
  24. Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425.
  25. Calibrating sequence likelihood improves conditional language generation. arXiv preprint arXiv:2210.00045.
  26. Learning to compare for better training and evaluation of open domain natural language generation models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 9717–9724.
  27. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Qi Gou (4 papers)
  2. Cam-Tu Nguyen (15 papers)
Citations (3)