Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Soft Preference Optimization: Aligning Language Models to Expert Distributions (2405.00747v4)

Published 30 Apr 2024 in cs.LG and cs.AI

Abstract: We propose Soft Preference Optimization (SPO), a method for aligning generative models, such as LLMs, with human preferences, without the need for a reward model. SPO optimizes model outputs directly over a preference dataset through a natural loss function that integrates preference loss with a regularization term across the model's entire output distribution rather than limiting it to the preference dataset. Although SPO does not require the assumption of an existing underlying reward model, we demonstrate that, under the Bradley-Terry (BT) model assumption, it converges to a softmax of scaled rewards, with the distribution's "softness" adjustable via the softmax exponent, an algorithm parameter. We showcase SPO's methodology, its theoretical foundation, and its comparative advantages in simplicity, computational efficiency, and alignment precision.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  2. Improving language understanding by generative pre-training. 2018.
  3. Unsupervised pretraining for sequence to sequence learning. arXiv preprint arXiv:1611.02683, 2016.
  4. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  5. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  6. Direct preference optimization with an offset. arXiv e-prints, pages arXiv–2402, 2024.
  7. Provably robust dpo: Aligning language models with noisy feedback. arXiv preprint arXiv:2403.00409, 2024.
  8. Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation. arXiv preprint arXiv:2401.08417, 2024.
  9. Relative preference optimization: Enhancing llm alignment through contrasting responses across identical and diverse prompts. arXiv preprint arXiv:2402.10958, 2024.
  10. Some things are more cringe than others: Preference optimization with the pairwise cringe loss. arXiv preprint arXiv:2312.16682, 2023.
  11. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023.
  12. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  13. R Duncan Luce. Individual choice behavior: A theoretical analysis. Courier Corporation, 2005.
  14. Robin L Plackett. The analysis of permutations. Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202, 1975.
  15. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  16. A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036, 2023.
  17. Calibrating sequence likelihood improves conditional language generation. In The Eleventh International Conference on Learning Representations, 2022.
  18. Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759, 2023.
  19. Karpathy. https://huggingface.co/karpathy/tinyllamas/commit/5c22d0a5f31b635f3e4bf8f2d4dd87363ae3a275, 2023.
  20. Andrej Karpathy. llama2.c: Inference llama 2 in one file of pure c. https://github.com/karpathy/llama2.c, 2024. GitHub repository.
  21. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  22. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  23. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023.
  24. The wisdom of hindsight makes language models better instruction followers. arXiv preprint arXiv:2302.05206, 2023.
  25. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
  26. Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023.
  27. Statistical rejection sampling improves preference optimization. In The Twelfth International Conference on Learning Representations, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Arsalan Sharifnassab (12 papers)
  2. Sina Ghiassian (18 papers)
  3. Saber Salehkaleybar (41 papers)
  4. Surya Kanoria (3 papers)
  5. Dale Schuurmans (112 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com