Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Provably Robust DPO: Aligning Language Models with Noisy Feedback (2403.00409v2)

Published 1 Mar 2024 in cs.LG and cs.CL

Abstract: Learning from preference-based feedback has recently gained traction as a promising approach to align LLMs with human interests. While these aligned generative models have demonstrated impressive capabilities across various tasks, their dependence on high-quality human preference data poses a bottleneck in practical applications. Specifically, noisy (incorrect and ambiguous) preference pairs in the dataset might restrict the LLMs from capturing human intent accurately. While practitioners have recently proposed heuristics to mitigate the effect of noisy preferences, a complete theoretical understanding of their workings remain elusive. In this work, we aim to bridge this gap by by introducing a general framework for policy optimization in the presence of random preference flips. We focus on the direct preference optimization (DPO) algorithm in particular since it assumes that preferences adhere to the Bradley-Terry-Luce (BTL) model, raising concerns about the impact of noisy data on the learned policy. We design a novel loss function, which de-bias the effect of noise on average, making a policy trained by minimizing that loss robust to the noise. Under log-linear parameterization of the policy class and assuming good feature coverage of the SFT policy, we prove that the sub-optimality gap of the proposed robust DPO (rDPO) policy compared to the optimal policy is of the order $O(\frac{1}{1-2\epsilon}\sqrt{\frac{d}{n}})$, where $\epsilon < 1/2$ is flip rate of labels, $d$ is policy parameter dimension and $n$ is size of dataset. Our experiments on IMDb sentiment generation and Anthropic's helpful-harmless dataset show that rDPO is robust to noise in preference labels compared to vanilla DPO and other heuristics proposed by practitioners.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. The Journal of Machine Learning Research, 22(1):4431–4506, 2021.
  2. A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036, 2023.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022a. URL https://arxiv.org/pdf/2204.05862.pdf.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022b.
  5. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  6. Human-in-the-loop: Provably efficient preference-based reinforcement learning with general function approximation. In International Conference on Machine Learning, pp.  3773–3793. PMLR, 2022.
  7. Differentially private reward estimation with preference feedback. arXiv preprint arXiv:2310.19733, 2023.
  8. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  9. A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability, 17(none):1 – 6, 2012. doi: 10.1214/ECP.v17-2079. URL https://doi.org/10.1214/ECP.v17-2079.
  10. A survey of reinforcement learning from human feedback. arXiv preprint arXiv:2312.14925, 2023.
  11. The history and risks of reinforcement learning and human feedback. arXiv e-prints, pp.  arXiv–2310, 2023.
  12. Statistical rejection sampling improves preference optimization. arXiv preprint arXiv:2309.06657, 2023.
  13. Luce, R. D. Individual choice behavior: A theoretical analysis. Courier Corporation, 2012.
  14. Learning word vectors for sentiment analysis. In Lin, D., Matsumoto, Y., and Mihalcea, R. (eds.), Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.  142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL https://aclanthology.org/P11-1015.
  15. Mitchell, E. A note on dpo with noisy preferences and relationship to ipo, 2023. URL https://ericmitchell.ai/cdpo.pdf.
  16. When does label smoothing help? Advances in neural information processing systems, 32, 2019.
  17. Learning with noisy labels. Advances in neural information processing systems, 26, 2013.
  18. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  19. Dueling rl: reinforcement learning with trajectory preferences. arXiv preprint arXiv:2111.04850, 2021.
  20. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1944–1952, 2017.
  21. Plackett, R. L. The analysis of permutations. Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202, 1975.
  22. Offline reinforcement learning with differential privacy. arXiv preprint arXiv:2206.00810, 2022.
  23. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  24. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  25. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  26. Thurstone, L. L. A law of comparative judgment. Psychological review, 34(4):273, 1927.
  27. Tropp, J. A. et al. An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning, 8(1-2):1–230, 2015.
  28. How to analyze paired comparison data. Department of Electrical Engineering University of Washington, Tech. Rep. UWEETR-2011-0004, 1, 2011.
  29. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020.
  30. Secrets of rlhf in large language models part ii: Reward modeling. arXiv preprint arXiv:2401.06080, 2024.
  31. What are the statistical limits of offline rl with linear function approximation? arXiv preprint arXiv:2010.11895, 2020.
  32. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023.
  33. Provable offline reinforcement learning with human feedback. arXiv preprint arXiv:2305.14816, 2023.
  34. Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023.
  35. Secrets of rlhf in large language models part i: Ppo. arXiv preprint arXiv:2307.04964, 2023.
  36. Principled reinforcement learning with human feedback from pairwise or k𝑘kitalic_k-wise comparisons. arXiv preprint arXiv:2301.11270, 2023.
  37. Iterative data smoothing: Mitigating reward overfitting and overoptimization in rlhf. arXiv preprint arXiv:2401.16335, 2024.
Citations (30)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com