Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ORPO: Monolithic Preference Optimization without Reference Model (2403.07691v2)

Published 12 Mar 2024 in cs.CL and cs.AI

Abstract: While recent preference alignment algorithms for LLMs have demonstrated promising results, supervised fine-tuning (SFT) remains imperative for achieving successful convergence. In this paper, we study the crucial role of SFT within the context of preference alignment, emphasizing that a minor penalty for the disfavored generation style is sufficient for preference-aligned SFT. Building on this foundation, we introduce a straightforward and innovative reference model-free monolithic odds ratio preference optimization algorithm, ORPO, eliminating the necessity for an additional preference alignment phase. We demonstrate, both empirically and theoretically, that the odds ratio is a sensible choice for contrasting favored and disfavored styles during SFT across the diverse sizes from 125M to 7B. Specifically, fine-tuning Phi-2 (2.7B), Llama-2 (7B), and Mistral (7B) with ORPO on the UltraFeedback alone surpasses the performance of state-of-the-art LLMs with more than 7B and 13B parameters: achieving up to 12.20% on $\text{AlpacaEval}_{2.0}$ (Figure 1), 66.19% on IFEval (instruction-level loose, Table 6), and 7.32 in MT-Bench (Figure 12). We release code and model checkpoints for Mistral-ORPO-$\alpha$ (7B) and Mistral-ORPO-$\beta$ (7B).

Reference-free Monolithic Preference Optimization with Odds Ratio

Introduction

In the landscape of preference alignment for LLMs, the role of Supervised Fine-Tuning (SFT) has historically been paramount. However, existing methods often employ multi-phase processes which involve additional models and training stages, leading to increased complexity and resource demand. Addressing this gap, the paper introduces an innovative approach dubbed Odds Ratio Preference Optimization (ORPO). This method streamlines preference alignment by embedding it directly within the SFT phase, eliminating the need for a separate alignment stage or a reference model.

Methodology

The Crucial Role of SFT

The paper begins by dissecting the function of SFT in existing alignment methods, revealing that while it adeptly tailors models to specific domains, it inadvertently raises the likelihood of undesired outputs. This phenomenon underscores the necessity for a mechanism that not only preserves the domain specificity of SFT but also discriminates against unfavorable generation styles effectively.

ORPO: A Monolithic Approach

ORPO is presented as a direct response to the above challenge. It merges domain adaptation and preference alignment into a singular process by utilizing an innovative loss function. This function comprises two components: a conventional negative log-likelihood loss for domain adaptation and a novel odds ratio loss that assigns penalties to disfavored outputs. The resulting optimization process enhances the model's ability to favor desired response styles without requiring a separate alignment stage or reference models.

Experiments and Results

Evaluation Framework

The efficacy of ORPO is evaluated across models of varying sizes and against state-of-the-art algorithms like RLHF and DPO. The benchmarks used include the AlpacaEval for instruction following and MT-Bench for multi-turn instruction following abilities. These evaluations extend to baselines established through SFT and subsequent reinforcement learning or direct policy optimization stages.

Empirical Outcomes

The ORPO algorithm demonstrates superior performance, particularly on the AlpacaEval_2.0 and MT-Bench benchmarks. Models fine-tuned using ORPO notably outperformed their counterparts aligned with traditional methods, achieving up to 12.20\% on AlpacaEval_2.0 and 7.32 on MT-Bench. Furthermore, controlled experiments validate ORPO's advantage over SFT, RLHF, and DPO across various datasets and model sizes.

Discussion

Theoretical Justification

The choice of the odds ratio over the probability ratio for the ORPO's loss function emerges from its stability and moderate discrimination between favored and disfavored responses. This selection is pivotal, especially when integrating preference alignment within the SFT phase, to avoid the over-suppression of disfavored responses which can compromise the model's overall generative quality.

Computational Efficiency

ORPO's architecture, devoid of a reference model, theoretically halves the number of required forward passes per training batch. This design not only simplifies the preference alignment process but also significantly reduces the computational overhead associated with multi-model or multi-phase approaches.

Future Perspectives

The introduction of ORPO marks a significant stride towards efficient and effective preference alignment in LLMs. Its simplicity, combined with demonstrated efficacy, paves the way for further exploration and potential enhancements in alignment methodologies. Future works might explore the adaptability of ORPO across broader domains, multifaceted preferences, and larger-scale models, potentially setting new benchmarks in the field of generative AI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. The falcon series of open language models.
  3. A general theoretical paradigm to understand learning from human preferences.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback.
  5. Constitutional ai: Harmlessness from ai feedback.
  6. Notus. https://github.com/argilla-io/notus.
  7. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
  8. Ralph Allan Bradley and Milton E. Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324--345.
  9. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877--1901. Curran Associates, Inc.
  10. Ulma: Unified language model alignment with demonstration and point-wise human preference. ArXiv, abs/2312.02554.
  11. Extracting training data from large language models.
  12. Grath: Gradual self-truthifying for large language models.
  13. Can ai assistants know what they don’t know?
  14. Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning.
  15. Qlora: Efficient finetuning of quantized llms.
  16. Enhancing chat language models by scaling high-quality instructional conversations.
  17. How abilities in large language models are affected by supervised fine-tuning data composition.
  18. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306.
  19. Scaling laws for reward model overoptimization.
  20. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356--3369, Online. Association for Computational Linguistics.
  21. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  22. Aaron Gokaslan and Vanya Cohen. 2019. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus.
  23. Alexey Gorbatovski and Sergey Kovalchuk. 2024. Reinforcement learning for question answering in programming domain using public community scoring as a human feedback.
  24. Hamish Haggerty and Rohitash Chandra. 2024. Self-supervised learning for skin cancer diagnosis with limited training data.
  25. Mojan Javaheripi and Sébastien Bubeck. 2023. Phi-2: The surprising power of small language models.
  26. Mistral 7b.
  27. Understanding the effects of rlhf on llm generalisation and diversity.
  28. Rlaif: Scaling reinforcement learning from human feedback with ai feedback.
  29. Don’t say that! making inconsistent dialogue unlikely with unlikelihood training. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4715--4728, Online. Association for Computational Linguistics.
  30. Self-alignment with instruction backtranslation.
  31. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
  32. Textbooks are all you need ii: phi-1.5 technical report.
  33. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980--2988.
  34. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization.
  35. Cross-entropy loss functions: Theoretical analysis and applications.
  36. Training language models to follow instructions with human feedback.
  37. Language model self-improvement by reinforcement learning contemplation.
  38. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only.
  39. Automatic prompt optimization with ‘‘gradient descent’’ and beam search. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7957--7968, Singapore. Association for Computational Linguistics.
  40. Direct preference optimization: Your language model is secretly a reward model.
  41. Aligning neural machine translation models: Human feedback in training and inference.
  42. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, page 3505–3506, New York, NY, USA. Association for Computing Machinery.
  43. Proximal policy optimization algorithms.
  44. Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2859--2873, Singapore. Association for Computational Linguistics.
  45. Preference ranking optimization for human alignment.
  46. Learning to summarize from human feedback.
  47. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  48. Fine-tuning language models for factuality.
  49. Llama: Open and efficient foundation language models.
  50. Zephyr: Direct distillation of lm alignment.
  51. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl.
  52. Secrets of rlhf in large language models part ii: Reward modeling.
  53. How far can camels go? exploring the state of instruction tuning on open resources.
  54. Finetuned language models are zero-shot learners.
  55. Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319.
  56. Pairwise proximal policy optimization: Harnessing relative feedback for llm alignment.
  57. Opt: Open pre-trained transformer language models.
  58. Pytorch fsdp: Experiences on scaling fully sharded data parallel.
  59. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. ArXiv:2306.05685 [cs].
  60. Lima: Less is more for alignment.
  61. Lobass: Gauging learnability in supervised fine-tuning data. ArXiv, abs/2310.13008.
  62. Fine-tuning language models from human preferences.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Jiwoo Hong (12 papers)
  2. Noah Lee (10 papers)
  3. James Thorne (48 papers)
Citations (112)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Youtube Logo Streamline Icon: https://streamlinehq.com