ORPO: Monolithic Preference Optimization without Reference Model (2403.07691v2)

Published 12 Mar 2024 in cs.CL and cs.AI

Abstract: While recent preference alignment algorithms for LLMs have demonstrated promising results, supervised fine-tuning (SFT) remains imperative for achieving successful convergence. In this paper, we study the crucial role of SFT within the context of preference alignment, emphasizing that a minor penalty for the disfavored generation style is sufficient for preference-aligned SFT. Building on this foundation, we introduce a straightforward and innovative reference model-free monolithic odds ratio preference optimization algorithm, ORPO, eliminating the necessity for an additional preference alignment phase. We demonstrate, both empirically and theoretically, that the odds ratio is a sensible choice for contrasting favored and disfavored styles during SFT across the diverse sizes from 125M to 7B. Specifically, fine-tuning Phi-2 (2.7B), Llama-2 (7B), and Mistral (7B) with ORPO on the UltraFeedback alone surpasses the performance of state-of-the-art LLMs with more than 7B and 13B parameters: achieving up to 12.20% on $\text{AlpacaEval}_{2.0}$ (Figure 1), 66.19% on IFEval (instruction-level loose, Table 6), and 7.32 in MT-Bench (Figure 12). We release code and model checkpoints for Mistral-ORPO-$\alpha$ (7B) and Mistral-ORPO-$\beta$ (7B).

PDF HTML Abstract

Reference-free Monolithic Preference Optimization with Odds Ratio

Introduction

In the landscape of preference alignment for LLMs, the role of Supervised Fine-Tuning (SFT) has historically been paramount. However, existing methods often employ multi-phase processes which involve additional models and training stages, leading to increased complexity and resource demand. Addressing this gap, the paper introduces an innovative approach dubbed Odds Ratio Preference Optimization (ORPO). This method streamlines preference alignment by embedding it directly within the SFT phase, eliminating the need for a separate alignment stage or a reference model.

Methodology

The Crucial Role of SFT

The paper begins by dissecting the function of SFT in existing alignment methods, revealing that while it adeptly tailors models to specific domains, it inadvertently raises the likelihood of undesired outputs. This phenomenon underscores the necessity for a mechanism that not only preserves the domain specificity of SFT but also discriminates against unfavorable generation styles effectively.

ORPO: A Monolithic Approach

ORPO is presented as a direct response to the above challenge. It merges domain adaptation and preference alignment into a singular process by utilizing an innovative loss function. This function comprises two components: a conventional negative log-likelihood loss for domain adaptation and a novel odds ratio loss that assigns penalties to disfavored outputs. The resulting optimization process enhances the model's ability to favor desired response styles without requiring a separate alignment stage or reference models.

Experiments and Results

Evaluation Framework

The efficacy of ORPO is evaluated across models of varying sizes and against state-of-the-art algorithms like RLHF and DPO. The benchmarks used include the AlpacaEval for instruction following and MT-Bench for multi-turn instruction following abilities. These evaluations extend to baselines established through SFT and subsequent reinforcement learning or direct policy optimization stages.

Empirical Outcomes

The ORPO algorithm demonstrates superior performance, particularly on the AlpacaEval_2.0 and MT-Bench benchmarks. Models fine-tuned using ORPO notably outperformed their counterparts aligned with traditional methods, achieving up to 12.20\% on AlpacaEval_2.0 and 7.32 on MT-Bench. Furthermore, controlled experiments validate ORPO's advantage over SFT, RLHF, and DPO across various datasets and model sizes.

Discussion

Theoretical Justification

The choice of the odds ratio over the probability ratio for the ORPO's loss function emerges from its stability and moderate discrimination between favored and disfavored responses. This selection is pivotal, especially when integrating preference alignment within the SFT phase, to avoid the over-suppression of disfavored responses which can compromise the model's overall generative quality.

Computational Efficiency

ORPO's architecture, devoid of a reference model, theoretically halves the number of required forward passes per training batch. This design not only simplifies the preference alignment process but also significantly reduces the computational overhead associated with multi-model or multi-phase approaches.

Future Perspectives

The introduction of ORPO marks a significant stride towards efficient and effective preference alignment in LLMs. Its simplicity, combined with demonstrated efficacy, paves the way for further exploration and potential enhancements in alignment methodologies. Future works might explore the adaptability of ORPO across broader domains, multifaceted preferences, and larger-scale models, potentially setting new benchmarks in the field of generative AI.