Reference-free Monolithic Preference Optimization with Odds Ratio
Introduction
In the landscape of preference alignment for LLMs, the role of Supervised Fine-Tuning (SFT) has historically been paramount. However, existing methods often employ multi-phase processes which involve additional models and training stages, leading to increased complexity and resource demand. Addressing this gap, the paper introduces an innovative approach dubbed Odds Ratio Preference Optimization (ORPO). This method streamlines preference alignment by embedding it directly within the SFT phase, eliminating the need for a separate alignment stage or a reference model.
Methodology
The Crucial Role of SFT
The paper begins by dissecting the function of SFT in existing alignment methods, revealing that while it adeptly tailors models to specific domains, it inadvertently raises the likelihood of undesired outputs. This phenomenon underscores the necessity for a mechanism that not only preserves the domain specificity of SFT but also discriminates against unfavorable generation styles effectively.
ORPO: A Monolithic Approach
ORPO is presented as a direct response to the above challenge. It merges domain adaptation and preference alignment into a singular process by utilizing an innovative loss function. This function comprises two components: a conventional negative log-likelihood loss for domain adaptation and a novel odds ratio loss that assigns penalties to disfavored outputs. The resulting optimization process enhances the model's ability to favor desired response styles without requiring a separate alignment stage or reference models.
Experiments and Results
Evaluation Framework
The efficacy of ORPO is evaluated across models of varying sizes and against state-of-the-art algorithms like RLHF and DPO. The benchmarks used include the AlpacaEval for instruction following and MT-Bench for multi-turn instruction following abilities. These evaluations extend to baselines established through SFT and subsequent reinforcement learning or direct policy optimization stages.
Empirical Outcomes
The ORPO algorithm demonstrates superior performance, particularly on the AlpacaEval_2.0 and MT-Bench benchmarks. Models fine-tuned using ORPO notably outperformed their counterparts aligned with traditional methods, achieving up to 12.20\% on AlpacaEval_2.0 and 7.32 on MT-Bench. Furthermore, controlled experiments validate ORPO's advantage over SFT, RLHF, and DPO across various datasets and model sizes.
Discussion
Theoretical Justification
The choice of the odds ratio over the probability ratio for the ORPO's loss function emerges from its stability and moderate discrimination between favored and disfavored responses. This selection is pivotal, especially when integrating preference alignment within the SFT phase, to avoid the over-suppression of disfavored responses which can compromise the model's overall generative quality.
Computational Efficiency
ORPO's architecture, devoid of a reference model, theoretically halves the number of required forward passes per training batch. This design not only simplifies the preference alignment process but also significantly reduces the computational overhead associated with multi-model or multi-phase approaches.
Future Perspectives
The introduction of ORPO marks a significant stride towards efficient and effective preference alignment in LLMs. Its simplicity, combined with demonstrated efficacy, paves the way for further exploration and potential enhancements in alignment methodologies. Future works might explore the adaptability of ORPO across broader domains, multifaceted preferences, and larger-scale models, potentially setting new benchmarks in the field of generative AI.