Soft Preference Optimization: Aligning Language Models to Expert Distributions (2405.00747v4)
Abstract: We propose Soft Preference Optimization (SPO), a method for aligning generative models, such as LLMs, with human preferences, without the need for a reward model. SPO optimizes model outputs directly over a preference dataset through a natural loss function that integrates preference loss with a regularization term across the model's entire output distribution rather than limiting it to the preference dataset. Although SPO does not require the assumption of an existing underlying reward model, we demonstrate that, under the Bradley-Terry (BT) model assumption, it converges to a softmax of scaled rewards, with the distribution's "softness" adjustable via the softmax exponent, an algorithm parameter. We showcase SPO's methodology, its theoretical foundation, and its comparative advantages in simplicity, computational efficiency, and alignment precision.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
- Improving language understanding by generative pre-training. 2018.
- Unsupervised pretraining for sequence to sequence learning. arXiv preprint arXiv:1611.02683, 2016.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
- Direct preference optimization with an offset. arXiv e-prints, pages arXiv–2402, 2024.
- Provably robust dpo: Aligning language models with noisy feedback. arXiv preprint arXiv:2403.00409, 2024.
- Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation. arXiv preprint arXiv:2401.08417, 2024.
- Relative preference optimization: Enhancing llm alignment through contrasting responses across identical and diverse prompts. arXiv preprint arXiv:2402.10958, 2024.
- Some things are more cringe than others: Preference optimization with the pairwise cringe loss. arXiv preprint arXiv:2312.16682, 2023.
- Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- R Duncan Luce. Individual choice behavior: A theoretical analysis. Courier Corporation, 2005.
- Robin L Plackett. The analysis of permutations. Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202, 1975.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036, 2023.
- Calibrating sequence likelihood improves conditional language generation. In The Eleventh International Conference on Learning Representations, 2022.
- Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759, 2023.
- Karpathy. https://huggingface.co/karpathy/tinyllamas/commit/5c22d0a5f31b635f3e4bf8f2d4dd87363ae3a275, 2023.
- Andrej Karpathy. llama2.c: Inference llama 2 in one file of pure c. https://github.com/karpathy/llama2.c, 2024. GitHub repository.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023.
- The wisdom of hindsight makes language models better instruction followers. arXiv preprint arXiv:2302.05206, 2023.
- Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
- Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023.
- Statistical rejection sampling improves preference optimization. In The Twelfth International Conference on Learning Representations, 2023.
- Arsalan Sharifnassab (12 papers)
- Sina Ghiassian (18 papers)
- Saber Salehkaleybar (41 papers)
- Surya Kanoria (3 papers)
- Dale Schuurmans (112 papers)