KTO: Model Alignment as Prospect Theoretic Optimization (2402.01306v4)
Abstract: Kahneman & Tversky's $\textit{prospect theory}$ tells us that humans perceive random variables in a biased but well-defined manner (1992); for example, humans are famously loss-averse. We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases -- the success of these objectives (e.g., DPO) over cross-entropy minimization can partly be ascribed to them belonging to a family of loss functions that we call $\textit{human-aware losses}$ (HALOs). However, the utility functions these methods attribute to humans still differ from those in the prospect theory literature. Using a Kahneman-Tversky model of human utility, we propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as current methods do. We call this approach KTO, and it matches or exceeds the performance of preference-based methods at scales from 1B to 30B, despite only learning from a binary signal of whether an output is desirable. More broadly, our work suggests that there is no one HALO that is universally superior; the best loss depends on the inductive biases most appropriate for a given setting, an oft-overlooked consideration.
- Improving language models with advantage-based offline policy gradients. arXiv preprint arXiv:2305.14718, 2023.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp. 2397–2430. PMLR, 2023.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm. Machine learning, 97:327–351, 2014.
- Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023.
- Human irrationality: both bad and good for reward inference. arXiv preprint arXiv:2111.06956, 2021.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Ultrafeedback: Boosting language models with high-quality feedback, 2023.
- Understanding dataset difficulty with 𝒱𝒱\mathcal{V}caligraphic_V-usable information. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 5988–6008. PMLR, 17–23 Jul 2022.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
- Aligning language models with preferences through f-divergence minimization. arXiv preprint arXiv:2302.08215, 2023.
- Decision-making under uncertainty–a field study of cumulative prospect theory. Journal of Banking & Finance, 33(7):1221–1229, 2009.
- Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web, pp. 173–182, 2017.
- Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
- Constructing stable preferences: A look into dimensions of experience and their impact on preference stability. Journal of consumer psychology, 8(2):113–139, 1999.
- Holm, S. A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics, pp. 65–70, 1979.
- Learning trajectory preferences for manipulators via iterative improvement. Advances in neural information processing systems, 26, 2013.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Prospect theory: An analysis of decision under risk. Econometrica, 47(2):263–292, 1979.
- Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023.
- Pretraining language models with human preferences. In International Conference on Machine Learning, pp. 17506–17533. PMLR, 2023.
- Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009.
- Reliability and learnability of human bandit feedback for sequence-to-sequence reinforcement learning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1777–1788, 2018.
- When humans aren’t optimal: Robots that collaborate with risk-aware humans. In Proceedings of the 2020 ACM/IEEE international conference on human-robot interaction, pp. 43–52, 2020.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
- Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pp. 745–750, 2007.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
- Interpretable modelling of driving behaviors in interactive driving scenarios based on cumulative prospect theory. In 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pp. 4329–4335. IEEE, 2019.
- Fine-tuning language models for factuality. arXiv preprint arXiv:2311.08401, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Zephyr: Direct distillation of lm alignment, 2023.
- Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk and uncertainty, 5:297–323, 1992.
- Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020.
- Neural text generation with unlikelihood training. In International Conference on Learning Representations, 2019.
- Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024.
- Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.