Stepwise Alignment for Constrained Language Model Policy Optimization (2404.11049v3)
Abstract: Safety and trustworthiness are indispensable requirements for real-world applications of AI systems using LLMs. This paper formulates human value alignment as an optimization problem of the LLM policy to maximize reward under a safety constraint, and then proposes an algorithm, Stepwise Alignment for Constrained Policy Optimization (SACPO). One key idea behind SACPO, supported by theory, is that the optimal policy incorporating reward and safety can be directly obtained from a reward-aligned policy. Building on this key idea, SACPO aligns LLMs step-wise with each metric while leveraging simple yet powerful alignment algorithms such as direct preference optimization (DPO). SACPO offers several advantages, including simplicity, stability, computational efficiency, and flexibility of algorithms and datasets. Under mild assumptions, our theoretical analysis provides the upper bounds on optimality and safety constraint violation. Our experimental results show that SACPO can fine-tune Alpaca-7B better than the state-of-the-art method in terms of both helpfulness and harmlessness.
- Back to basics: Revisiting REINFORCE style optimization for learning from human feedback in LLMs. arXiv preprint arXiv:2402.14740, 2024.
- E. Altman. Constrained Markov decision processes. Routledge, 2021.
- Concrete problems in AI safety. arXiv preprint arXiv:1606.06565, 2016.
- A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036, 2023.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- D. P. Bertsekas. Constrained optimization and Lagrange multiplier methods. Academic press, 2014.
- N. Bostrom and E. Yudkowsky. The ethics of artificial intelligence. In Artificial intelligence safety and security, pages 57–69. Chapman and Hall/CRC, 2018.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
- Safe RLHF: Safe reinforcement learning from human feedback. In International Conference on Learning Representations (ICLR), 2024.
- Provably efficient safe exploration via primal-dual policy optimization. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 3304–3312, 2021.
- Last-iterate convergent policy gradient primal-dual methods for constrained MDPs. Advances in Neural Information Processing Systems (NeurIPS), 36, 2024.
- KTO: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024.
- Pal: Program-aided language models. In International Conference on Machine Learning (ICML), pages 10764–10799, 2023.
- Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020.
- Arcee’s mergekit: A toolkit for merging large language models. arXiv preprint arXiv:2403.13257, 2024.
- Beavertails: Towards improved safety alignment of LLM via a human-preference dataset. Advances in Neural Information Processing Systems (NeurIPS), 36, 2024.
- Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory (COLT), pages 2137–2143. PMLR, 2020.
- Understanding the effects of RLHF on LLM generalisation and diversity. In International Conference on Learning Representations (ICLR), 2023.
- Batch policy learning under constraints. In International Conference on Machine Learning (ICML), pages 3703–3712, 2019.
- A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670, 2010.
- AlpacaEval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023.
- Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
- Mitigating political bias in language models through reinforced calibration. In AAAI Conference on Artificial Intelligence (AAAI), 2021.
- Trustworthy llms: a survey and guideline for evaluating large language models’ alignment. arXiv preprint arXiv:2308.05374, 2023.
- Enhancing LLM safety via constrained direct preference optimization. arXiv preprint arXiv:2403.02475, 2024.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Safe policies for reinforcement learning via primal-dual methods. IEEE Transactions on Automatic Control, 68(3):1321–1336, 2022.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
- Benchmarking safe exploration in deep reinforcement learning. arXiv preprint arXiv:1910.01708, 2019.
- Verbosity bias in preference labeling by large language models. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. In Conference on Robot Learning (CoRL), pages 492–504. PMLR, 2023.
- A long way to go: Investigating length correlations in RLHF. arXiv preprint arXiv:2310.03716, 2023.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
- Mitigating gender bias in natural language processing: Literature review. Association for Computational Linguistics (ACL), 2019.
- Decodingtrust: A comprehensive assessment of trustworthiness in GPT models. arXiv preprint arXiv:2306.11698, 2023.
- Beyond reverse KL: Generalizing direct preference optimization with diverse divergence constraints. In International Conference on Learning Representations (ICLR), 2024.
- R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256, 1992.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning (ICML), pages 23965–23998, 2022.
- Iterative preference learning from human feedback: Bridging theory and practice for RLHF under KL-constraint. arXiv preprint arXiv:2312.11456, 2023.
- Wordcraft: story writing with large language models. In International Conference on Intelligent User Interfaces, pages 841–852, 2022.
- Prompting large language model for machine translation: A case study. In International Conference on Machine Learning (ICML), pages 41092–41110, 2023.
- Panacea: Pareto alignment via preference adaptation for LLMs. arXiv preprint arXiv:2402.02030, 2024.
- Beyond one-preference-for-all: Multi-objective direct preference optimization. arXiv preprint arXiv:2310.03708, 2023.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
- M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In International Conference on Machine Learning (ICML), pages 928–936, 2003.