KTO: Model Alignment as Prospect Theoretic Optimization (2402.01306v4)

Published 2 Feb 2024 in cs.LG and cs.AI

Abstract: Kahneman & Tversky's $\textit{prospect theory}$ tells us that humans perceive random variables in a biased but well-defined manner (1992); for example, humans are famously loss-averse. We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases -- the success of these objectives (e.g., DPO) over cross-entropy minimization can partly be ascribed to them belonging to a family of loss functions that we call $\textit{human-aware losses}$ (HALOs). However, the utility functions these methods attribute to humans still differ from those in the prospect theory literature. Using a Kahneman-Tversky model of human utility, we propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as current methods do. We call this approach KTO, and it matches or exceeds the performance of preference-based methods at scales from 1B to 30B, despite only learning from a binary signal of whether an output is desirable. More broadly, our work suggests that there is no one HALO that is universally superior; the best loss depends on the inductive biases most appropriate for a given setting, an oft-overlooked consideration.

References (47)

Citations (284)

View on Semantic Scholar

Summary

The paper introduces KTO, leveraging binary feedback and Kahneman-Tversky prospect theory to directly optimize human utility in LLM alignment.
It empirically demonstrates that KTO outperforms methods like DPO across model sizes and remains robust even with imbalanced or reduced data.
The method simplifies the alignment pipeline by bypassing SFT and effectively handling noisy, intransitive preferences, leading to benchmark improvements such as a 13.5-point rise on GSM8K.

This paper introduces Kahneman-Tversky Optimization (KTO), a novel method for aligning LLMs with human feedback that utilizes a simpler binary signal of "desirable" or "undesirable" outputs, rather than more complex and expensive-to-collect preference pairs (e.g., output A is better than output B). The core idea is to frame model alignment through the lens of Kahneman & Tversky's prospect theory, which describes how humans perceive value and make decisions under uncertainty, incorporating cognitive biases like loss aversion.

The authors first define a class of loss functions called Human-Aware Loss Functions (HALOs). These functions implicitly model human cognitive biases. The paper argues that existing successful alignment methods like Direct Preference Optimization (DPO) and Proximal Policy Optimization (PPO-Clip, used in RLHF) can be categorized as HALOs. They demonstrate empirically that HALOs generally outperform non-HALOs.

The key contribution is KTO, derived by adapting the Kahneman-Tversky model of human utility for LLM alignment. Instead of maximizing the log-likelihood of preferences like DPO, KTO directly aims to maximize the utility of generated outputs.

The KTO loss function is defined as:

$L_{KTO}(\pi_\theta, \pi_{ref}) = \mathbb{E}_{x,y \sim D} [w(y)(1 - U_{KTO}(x, y; \beta))]$

Where:

$x$ is the input, $y$ is the output, and $D$ is the dataset of input-output pairs labeled as desirable or undesirable.
$\pi_\theta$ is the policy model being optimized.
$\pi_{ref}$ is the reference model (usually an SFT model).
$r_{KTO}(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{ref}(y|x)}$ is the implicit reward, similar to DPO, measuring the log-probability ratio scaled by $\beta$ .
$Z_{ref} = \mathbb{E}_{x' \sim D, y' \sim \pi^*} [r^*(x', y')]$ is a reference point, estimated as the expected reward under the optimal policy. In practice, this is simplified to the KL divergence between the optimal policy $\pi^*$ and $\pi_{ref}$ , scaled by $\beta$ .
$U_{KTO}(x, y; \beta)$ $U_{K TO} (x, y; β)$ is the utility function:
- $\sigma(r_{KTO}(x, y) - Z_{ref})$ if $y$ is desirable for $x$ .
- $\sigma(Z_{ref} - r_{KTO}(x, y))$ if $y$ is undesirable for $x$ .
- Here, $\sigma$ is the logistic function, used as an approximation of the Kahneman-Tversky value function due to its desirable properties (concave in gains, convex in losses).
$w(y)$ $w (y)$ are weights for desirable and undesirable outputs:
- $\lambda_D$ if $y$ is desirable for $x$ .
- $\lambda_U$ if $y$ is undesirable for $x$ .
- These weights are used to handle data imbalances. The paper suggests setting them such that $\frac{\lambda_D n_D}{\lambda_U n_U} \in [1, 4/3]$ , where $n_D$ and $n_U$ are the counts of desirable and undesirable examples. Typically, one is set to 1, and the other is adjusted. For example, if there's a 1:1 ratio, $\lambda_U=1, \lambda_D \in [1, 1.33]$ .

Implementation Details:

KL Term ( $Z_{ref}$ ): The KL term is estimated by matching inputs $x'$ with unrelated outputs $y'$ in a batch and calculating $\text{max}(0, 1 - \beta \log \frac{\pi_\theta(y'|x')}{\pi_{ref}(y'|x')})$ over the batch. Crucially, gradients are not backpropagated through this KL term estimation, which improves training stability. This term primarily serves to control the saturation of the loss.
Hyperparameter $\beta$ : This controls the deviation from the reference model $\pi_{ref}$ . A value of $\beta = 0.1$ is found to work well, similar to DPO.
Data Requirements: KTO only needs a binary signal for each output (desirable/undesirable).
- If data is already binary (e.g., thumbs up/down), it can be used directly.
- Preference data $(x, y_w, y_l)$ can be converted by treating $y_w$ as desirable and $y_l$ as undesirable. This means a dataset of $n$ DPO pairs becomes $2n$ examples for KTO.
- Score-based data can be converted by thresholding scores (above mean/median/fixed threshold is desirable) or by sampling desirability based on the score.

Key Experimental Findings:

Performance: KTO matches or exceeds DPO performance across model scales from 1B to 30B parameters, even when KTO is trained on data derived by splitting DPO preference pairs (effectively learning from a "weaker" signal per original preference).
Skipping SFT: KTO can be applied directly to a pretrained model without an SFT stage and still achieve strong performance. In contrast, DPO performance degrades significantly without SFT, often leading to rambling or hallucinated conversations.
Data Imbalance: KTO is robust to extreme data imbalances. It matched DPO performance even when using up to 90% fewer desirable examples (e.g., a 1:10 ratio of desirable to undesirable examples) by adjusting $\lambda_D$ and $\lambda_U$ .
Naturally Unpaired Data: Experiments where only one output per input was used (reducing data by 72% on OpenAssistant for Mistral-7B) showed KTO still outperforming DPO, suggesting KTO's effectiveness is not solely reliant on its data being sourced from preference pairs.
Benchmark Improvements: Replacing DPO with KTO in the Zephyr-β (a Mistral-7B derivative) training pipeline improved performance on benchmarks like MMLU, GSM8K, HumanEval, and BigBench-Hard. Notably, GSM8K performance increased by 13.5 points.

Theoretical Advantages of KTO:

Robustness to Noisy Data (Proposition 4.1): KTO's gradient diminishes for examples that are too "difficult" (e.g., an undesirable example with a very high model-assigned reward, or a desirable one with a very low reward). This can help ignore noisy or unlearnable feedback, though it also carries a risk of underfitting if truly hard but valid examples are ignored.
Direct Utility Optimization (Theorem 4.2): DPO maximizes preference likelihood. The paper shows that multiple reward functions can lead to the same preference likelihood but different underlying human utility. KTO, by contrast, aims to directly optimize a model of human utility.
Handling Intransitive Preferences (Theorem 4.3): In datasets with contradictory preferences (e.g., from multiple annotators), DPO can, in the worst case, learn a policy that decreases utility for all involved. KTO, with default settings, tends not to change the policy in such intransitive scenarios, offering better worst-case guarantees.

When to Use KTO vs. DPO:

Binary Feedback: KTO is the natural choice if feedback is inherently binary (e.g., thumbs up/down, pass/fail) or if there's a significant imbalance between desirable/undesirable examples.
Preference Data:
- If preference data is noisy or contains many intransitive preferences (common in public datasets like SHP, OpenAssistant, or even synthetic AI feedback like UltraFeedback), KTO may outperform DPO due to its robustness.
- If preference data is high-quality, with low noise and few intransitivities, DPO might be better as KTO risks underfitting.

Practical Implications:

KTO offers a more data-efficient and potentially more robust alignment method. Its ability to use simple binary feedback significantly lowers the barrier to collecting alignment data, as such signals are far more abundant and cheaper to obtain than full preference rankings. This could allow for faster iteration cycles and alignment on a wider variety of tasks where preference data is impractical to gather (e.g., using API calls to a toxicity detector to label outputs as desirable/undesirable). The finding that KTO can skip SFT in some cases also simplifies the alignment pipeline.

The paper concludes that KTO is a promising HALO, but not necessarily the definitive one for all scenarios, opening avenues for future research into other value functions and applications of KTO with synthetic binary feedback.

PDF Markdown

Related Papers

Tweets

https://twitter.com/younesbelkada/status/1770454022325215345

https://twitter.com/_lewtun/status/1768342520121925992

https://twitter.com/davidbstein1957/status/1771144137435189488

https://twitter.com/shion_honda/status/1788116537141817401

https://twitter.com/ethayarajh/status/1761592512337285442

https://twitter.com/ethayarajh/status/1875240891985883572

YouTube

Show All Videos