Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

KTO: Model Alignment as Prospect Theoretic Optimization (2402.01306v4)

Published 2 Feb 2024 in cs.LG and cs.AI

Abstract: Kahneman & Tversky's $\textit{prospect theory}$ tells us that humans perceive random variables in a biased but well-defined manner (1992); for example, humans are famously loss-averse. We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases -- the success of these objectives (e.g., DPO) over cross-entropy minimization can partly be ascribed to them belonging to a family of loss functions that we call $\textit{human-aware losses}$ (HALOs). However, the utility functions these methods attribute to humans still differ from those in the prospect theory literature. Using a Kahneman-Tversky model of human utility, we propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as current methods do. We call this approach KTO, and it matches or exceeds the performance of preference-based methods at scales from 1B to 30B, despite only learning from a binary signal of whether an output is desirable. More broadly, our work suggests that there is no one HALO that is universally superior; the best loss depends on the inductive biases most appropriate for a given setting, an oft-overlooked consideration.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Improving language models with advantage-based offline policy gradients. arXiv preprint arXiv:2305.14718, 2023.
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  3. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp.  2397–2430. PMLR, 2023.
  4. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  5. Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm. Machine learning, 97:327–351, 2014.
  6. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023.
  7. Human irrationality: both bad and good for reward inference. arXiv preprint arXiv:2111.06956, 2021.
  8. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  9. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024.
  10. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  11. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  12. Ultrafeedback: Boosting language models with high-quality feedback, 2023.
  13. Understanding dataset difficulty with 𝒱𝒱\mathcal{V}caligraphic_V-usable information. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  5988–6008. PMLR, 17–23 Jul 2022.
  14. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
  15. Aligning language models with preferences through f-divergence minimization. arXiv preprint arXiv:2302.08215, 2023.
  16. Decision-making under uncertainty–a field study of cumulative prospect theory. Journal of Banking & Finance, 33(7):1221–1229, 2009.
  17. Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web, pp.  173–182, 2017.
  18. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  19. Constructing stable preferences: A look into dimensions of experience and their impact on preference stability. Journal of consumer psychology, 8(2):113–139, 1999.
  20. Holm, S. A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics, pp.  65–70, 1979.
  21. Learning trajectory preferences for manipulators via iterative improvement. Advances in neural information processing systems, 26, 2013.
  22. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  23. Prospect theory: An analysis of decision under risk. Econometrica, 47(2):263–292, 1979.
  24. Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023.
  25. Pretraining language models with human preferences. In International Conference on Machine Learning, pp.  17506–17533. PMLR, 2023.
  26. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009.
  27. Reliability and learnability of human bandit feedback for sequence-to-sequence reinforcement learning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1777–1788, 2018.
  28. When humans aren’t optimal: Robots that collaborate with risk-aware humans. In Proceedings of the 2020 ACM/IEEE international conference on human-robot interaction, pp.  43–52, 2020.
  29. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023.
  30. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  31. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
  32. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pp.  745–750, 2007.
  33. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  34. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  35. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  36. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  37. Interpretable modelling of driving behaviors in interactive driving scenarios based on cumulative prospect theory. In 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pp.  4329–4335. IEEE, 2019.
  38. Fine-tuning language models for factuality. arXiv preprint arXiv:2311.08401, 2023.
  39. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  40. Zephyr: Direct distillation of lm alignment, 2023.
  41. Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk and uncertainty, 5:297–323, 1992.
  42. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020.
  43. Neural text generation with unlikelihood training. In International Conference on Learning Representations, 2019.
  44. Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024.
  45. Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023.
  46. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  47. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
Citations (284)

Summary

  • The paper introduces KTO, leveraging binary feedback and Kahneman-Tversky prospect theory to directly optimize human utility in LLM alignment.
  • It empirically demonstrates that KTO outperforms methods like DPO across model sizes and remains robust even with imbalanced or reduced data.
  • The method simplifies the alignment pipeline by bypassing SFT and effectively handling noisy, intransitive preferences, leading to benchmark improvements such as a 13.5-point rise on GSM8K.

This paper introduces Kahneman-Tversky Optimization (KTO), a novel method for aligning LLMs with human feedback that utilizes a simpler binary signal of "desirable" or "undesirable" outputs, rather than more complex and expensive-to-collect preference pairs (e.g., output A is better than output B). The core idea is to frame model alignment through the lens of Kahneman & Tversky's prospect theory, which describes how humans perceive value and make decisions under uncertainty, incorporating cognitive biases like loss aversion.

The authors first define a class of loss functions called Human-Aware Loss Functions (HALOs). These functions implicitly model human cognitive biases. The paper argues that existing successful alignment methods like Direct Preference Optimization (DPO) and Proximal Policy Optimization (PPO-Clip, used in RLHF) can be categorized as HALOs. They demonstrate empirically that HALOs generally outperform non-HALOs.

The key contribution is KTO, derived by adapting the Kahneman-Tversky model of human utility for LLM alignment. Instead of maximizing the log-likelihood of preferences like DPO, KTO directly aims to maximize the utility of generated outputs.

The KTO loss function is defined as:

LKTO(πθ,πref)=Ex,yD[w(y)(1UKTO(x,y;β))]L_{KTO}(\pi_\theta, \pi_{ref}) = \mathbb{E}_{x,y \sim D} [w(y)(1 - U_{KTO}(x, y; \beta))]

Where:

  • xx is the input, yy is the output, and DD is the dataset of input-output pairs labeled as desirable or undesirable.
  • πθ\pi_\theta is the policy model being optimized.
  • πref\pi_{ref} is the reference model (usually an SFT model).
  • rKTO(x,y)=βlogπθ(yx)πref(yx)r_{KTO}(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{ref}(y|x)} is the implicit reward, similar to DPO, measuring the log-probability ratio scaled by β\beta.
  • Zref=ExD,yπ[r(x,y)]Z_{ref} = \mathbb{E}_{x' \sim D, y' \sim \pi^*} [r^*(x', y')] is a reference point, estimated as the expected reward under the optimal policy. In practice, this is simplified to the KL divergence between the optimal policy π\pi^* and πref\pi_{ref}, scaled by β\beta.
  • UKTO(x,y;β)U_{KTO}(x, y; \beta) is the utility function:
    • σ(rKTO(x,y)Zref)\sigma(r_{KTO}(x, y) - Z_{ref}) if yy is desirable for xx.
    • σ(ZrefrKTO(x,y))\sigma(Z_{ref} - r_{KTO}(x, y)) if yy is undesirable for xx.
    • Here, σ\sigma is the logistic function, used as an approximation of the Kahneman-Tversky value function due to its desirable properties (concave in gains, convex in losses).
  • w(y)w(y) are weights for desirable and undesirable outputs:
    • λD\lambda_D if yy is desirable for xx.
    • λU\lambda_U if yy is undesirable for xx.
    • These weights are used to handle data imbalances. The paper suggests setting them such that λDnDλUnU[1,4/3]\frac{\lambda_D n_D}{\lambda_U n_U} \in [1, 4/3], where nDn_D and nUn_U are the counts of desirable and undesirable examples. Typically, one is set to 1, and the other is adjusted. For example, if there's a 1:1 ratio, λU=1,λD[1,1.33]\lambda_U=1, \lambda_D \in [1, 1.33].

Implementation Details:

  • KL Term (ZrefZ_{ref}): The KL term is estimated by matching inputs xx' with unrelated outputs yy' in a batch and calculating max(0,1βlogπθ(yx)πref(yx))\text{max}(0, 1 - \beta \log \frac{\pi_\theta(y'|x')}{\pi_{ref}(y'|x')}) over the batch. Crucially, gradients are not backpropagated through this KL term estimation, which improves training stability. This term primarily serves to control the saturation of the loss.
  • Hyperparameter β\beta: This controls the deviation from the reference model πref\pi_{ref}. A value of β=0.1\beta = 0.1 is found to work well, similar to DPO.
  • Data Requirements: KTO only needs a binary signal for each output (desirable/undesirable).
    • If data is already binary (e.g., thumbs up/down), it can be used directly.
    • Preference data (x,yw,yl)(x, y_w, y_l) can be converted by treating ywy_w as desirable and yly_l as undesirable. This means a dataset of nn DPO pairs becomes $2n$ examples for KTO.
    • Score-based data can be converted by thresholding scores (above mean/median/fixed threshold is desirable) or by sampling desirability based on the score.

Key Experimental Findings:

  1. Performance: KTO matches or exceeds DPO performance across model scales from 1B to 30B parameters, even when KTO is trained on data derived by splitting DPO preference pairs (effectively learning from a "weaker" signal per original preference).
  2. Skipping SFT: KTO can be applied directly to a pretrained model without an SFT stage and still achieve strong performance. In contrast, DPO performance degrades significantly without SFT, often leading to rambling or hallucinated conversations.
  3. Data Imbalance: KTO is robust to extreme data imbalances. It matched DPO performance even when using up to 90% fewer desirable examples (e.g., a 1:10 ratio of desirable to undesirable examples) by adjusting λD\lambda_D and λU\lambda_U.
  4. Naturally Unpaired Data: Experiments where only one output per input was used (reducing data by 72% on OpenAssistant for Mistral-7B) showed KTO still outperforming DPO, suggesting KTO's effectiveness is not solely reliant on its data being sourced from preference pairs.
  5. Benchmark Improvements: Replacing DPO with KTO in the Zephyr-β (a Mistral-7B derivative) training pipeline improved performance on benchmarks like MMLU, GSM8K, HumanEval, and BigBench-Hard. Notably, GSM8K performance increased by 13.5 points.

Theoretical Advantages of KTO:

  • Robustness to Noisy Data (Proposition 4.1): KTO's gradient diminishes for examples that are too "difficult" (e.g., an undesirable example with a very high model-assigned reward, or a desirable one with a very low reward). This can help ignore noisy or unlearnable feedback, though it also carries a risk of underfitting if truly hard but valid examples are ignored.
  • Direct Utility Optimization (Theorem 4.2): DPO maximizes preference likelihood. The paper shows that multiple reward functions can lead to the same preference likelihood but different underlying human utility. KTO, by contrast, aims to directly optimize a model of human utility.
  • Handling Intransitive Preferences (Theorem 4.3): In datasets with contradictory preferences (e.g., from multiple annotators), DPO can, in the worst case, learn a policy that decreases utility for all involved. KTO, with default settings, tends not to change the policy in such intransitive scenarios, offering better worst-case guarantees.

When to Use KTO vs. DPO:

  • Binary Feedback: KTO is the natural choice if feedback is inherently binary (e.g., thumbs up/down, pass/fail) or if there's a significant imbalance between desirable/undesirable examples.
  • Preference Data:
    • If preference data is noisy or contains many intransitive preferences (common in public datasets like SHP, OpenAssistant, or even synthetic AI feedback like UltraFeedback), KTO may outperform DPO due to its robustness.
    • If preference data is high-quality, with low noise and few intransitivities, DPO might be better as KTO risks underfitting.

Practical Implications:

KTO offers a more data-efficient and potentially more robust alignment method. Its ability to use simple binary feedback significantly lowers the barrier to collecting alignment data, as such signals are far more abundant and cheaper to obtain than full preference rankings. This could allow for faster iteration cycles and alignment on a wider variety of tasks where preference data is impractical to gather (e.g., using API calls to a toxicity detector to label outputs as desirable/undesirable). The finding that KTO can skip SFT in some cases also simplifies the alignment pipeline.

The paper concludes that KTO is a promising HALO, but not necessarily the definitive one for all scenarios, opening avenues for future research into other value functions and applications of KTO with synthetic binary feedback.

Youtube Logo Streamline Icon: https://streamlinehq.com