Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts (2402.10958v2)
Abstract: In the field of LLMs, aligning models with the diverse preferences of users is a critical challenge. Direct Preference Optimization (DPO) has played a key role in this area. It works by using pairs of preferences derived from the same prompts, and it functions without needing an additional reward model. However, DPO does not fully reflect the complex nature of human learning, which often involves understanding contrasting responses to not only identical but also similar questions. To overcome this shortfall, we propose Relative Preference Optimization (RPO). RPO is designed to discern between more and less preferred responses derived from both identical and related prompts. It introduces a contrastive weighting mechanism, enabling the tuning of LLMs using a broader range of preference data, including both paired and unpaired sets. This approach expands the learning capabilities of the model, allowing it to leverage insights from a more varied set of prompts. Through empirical tests, including dialogue and summarization tasks, and evaluations using the AlpacaEval2.0 leaderboard, RPO has demonstrated a superior ability to align LLMs with user preferences and to improve their adaptability during the training process. Our code can be viewed at https://github.com/yinyueqin/relative-preference-optimization
- A general theoretical paradigm to understand learning from human preferences. AISTATS, 2024.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, pages 324–345, 1952.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
- Deep reinforcement learning from human preferences. NeurIPS, 2017.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- Opportunity, motivation, and ability to learn from failures and errors: Review, synthesis, and ways to move forward. Academy of Management Annals, pages 252–277, 2018.
- Human-centered loss functions (halos). Technical report, Contextual AI, 2023.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Alpacaeval: An automatic evaluator of instruction-following models. GitHub repository, 2023.
- OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
- Training language models to follow instructions with human feedback. NeurIPS, 35:27730–27744, 2022.
- Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning. EMNLP, 2023.
- Computational optimal transport. Center for Research in Economics and Statistics Working Papers, (2017-86), 2017.
- Direct preference optimization: Your language model is secretly a reward model. NeurIPS, 2023.
- Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, pages 1–67, 2020.
- Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. ICLR, 2023.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
- Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207, 2021.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Mpnet: Masked and permuted pre-training for language understanding. NeurIPS, 33:16857–16867, 2020.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
- Learning to summarize with human feedback. NeurIPS, 33:3008–3021, 2020.
- POUF: Prompt-oriented unsupervised fine-tuning for large pre-trained models. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
- Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
- Is chatgpt the ultimate programming assistant–how far is it? arXiv preprint arXiv:2304.11938, 2023.
- Tijmen Tieleman and G Hinton. Divide the gradient by a running average of its recent magnitude. coursera: Neural networks for machine learning. Technical report, 2017.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint, arXiv:2307.09288, 2023.
- Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. NeurIPS, 33:5776–5788, 2020.
- Self-instruct: Aligning language model with self generated instructions. ACL, 2023.
- Finetuned language models are zero-shot learners. ICLR, 2022.
- Metamath: Bootstrap your own mathematical questions for large language models. ICLR, 2024.
- Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023.
- Exploiting chain rule and Bayes’ theorem to compare probability distributions. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=f-ggKIDTu5D.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
- Yueqin Yin (8 papers)
- Zhendong Wang (60 papers)
- Yi Gu (69 papers)
- Hai Huang (47 papers)
- Weizhu Chen (128 papers)
- Mingyuan Zhou (161 papers)