Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enhancing LLM Safety via Constrained Direct Preference Optimization (2403.02475v1)

Published 4 Mar 2024 in cs.LG and cs.CL

Abstract: The rapidly increasing capabilities of LLMs raise an urgent need to align AI systems with diverse human preferences to simultaneously enhance their usefulness and safety, despite the often conflicting nature of these goals. To address this important problem, a promising approach is to enforce a safety constraint at the fine-tuning stage through a constrained Reinforcement Learning from Human Feedback (RLHF) framework. This approach, however, is computationally expensive and often unstable. In this work, we introduce Constrained DPO (C-DPO), a novel extension of the recently proposed Direct Preference Optimization (DPO) approach for fine-tuning LLMs that is both efficient and lightweight. By integrating dual gradient descent and DPO, our method identifies a nearly optimal trade-off between helpfulness and harmlessness without using reinforcement learning. Empirically, our approach provides a safety guarantee to LLMs that is missing in DPO while achieving significantly higher rewards under the same safety constraint compared to a recently proposed safe RLHF approach. Warning: This paper contains example data that may be offensive or harmful.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Reachability-based safe learning with gaussian processes. In 53rd IEEE Conference on Decision and Control, pp.  1424–1431. IEEE, 2014.
  2. Eitan Altman. Constrained Markov decision processes. Routledge, 2021.
  3. A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036, 2023.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
  5. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.
  6. Stackllama: An rl fine-tuned llama model for stack exchange question and answering, 2023. URL https://huggingface.co/blog/stackllama.
  7. Safe and robust learning control with gaussian processes. In 2015 European Control Conference (ECC), pp.  2496–2501. IEEE, 2015.
  8. Safe model-based reinforcement learning with stability guarantees. Advances in neural information processing systems, 30, 2017.
  9. Dimitri P Bertsekas. Constrained optimization and Lagrange multiplier methods. Academic press, 2014.
  10. Learning reward functions from diverse sources of human feedback: Optimally integrating demonstrations and preferences. The International Journal of Robotics Research, 41(1):45–67, 2022.
  11. Aligning robot and human representations. arXiv preprint arXiv:2302.01928, 2023.
  12. Vivek S Borkar. An actor-critic algorithm for constrained markov decision processes. Systems & control letters, 54(3):207–213, 2005.
  13. Convex optimization. Cambridge university press, 2004.
  14. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  15. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390, 2023.
  16. A primal-dual approach to constrained markov decision processes. arXiv preprint arXiv:2101.10895, 2021.
  17. Risk-constrained reinforcement learning with percentile risk criteria. Journal of Machine Learning Research, 18(167):1–51, 2018a.
  18. A lyapunov-based approach to safe reinforcement learning. Advances in neural information processing systems, 31, 2018b.
  19. Lyapunov-based safe policy optimization for continuous control. arXiv preprint arXiv:1901.10031, 2019.
  20. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  21. Safe rlhf: Safe reinforcement learning from human feedback, 2023.
  22. A review of safe reinforcement learning: Methods, theory and applications. https://arxiv.org/pdf/2205.10330.pdf, 2023.
  23. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  24. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. arXiv preprint arXiv:2307.04657, 2023.
  25. Batch policy learning under constraints. In International Conference on Machine Learning, pp.  3703–3712. PMLR, 2019.
  26. Statistical rejection sampling improves preference optimization, 2024.
  27. Nash learning from human feedback. arXiv preprint arXiv:2312.00886, 2023.
  28. OpenAI. Gpt-4 technical report, 2023.
  29. Diagnosis, feedback, adaptation: A human-in-the-loop framework for test-time policy adaptation. In International Conference on Machine Learning, pp.  27630–27641. PMLR, 2023.
  30. Safe policy search using gaussian process models. In Proceedings of the 18th international conference on autonomous agents and multiagent systems, pp.  1565–1573, 2019.
  31. Direct preference optimization: Your language model is secretly a reward model. 2023.
  32. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. arXiv preprint arXiv:2306.04488, 2023.
  33. Regret-based reward elicitation for markov decision processes. arXiv preprint arXiv:1205.2619, 2012.
  34. Trust region policy optimization. In International conference on machine learning, pp.  1889–1897. PMLR, 2015.
  35. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  36. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  37. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  38. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. In NeurIPS, 2023.
  39. Jailbroken: How does llm safety training fail? In NeurIPS, 2023.
  40. Fine-grained human feedback gives better rewards for language model training. arXiv preprint arXiv:2306.01693, 2023.
  41. A primal approach to constrained policy optimization: Global optimality and finite-time analysis. 2020.
  42. Projection-based constrained policy optimization. arXiv preprint arXiv:2010.03152, 2020.
  43. Towards safe reinforcement learning via constraining conditional value-at-risk. arXiv preprint arXiv:2206.04436, 2022.
  44. Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023.
  45. Beyond one-preference-for-all: Multi-objective direct preference optimization. arXiv preprint arXiv:2310.03708, 2023.
  46. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
  47. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Zixuan Liu (38 papers)
  2. Xiaolin Sun (8 papers)
  3. Zizhan Zheng (33 papers)
Citations (13)