Emergent Mind

Abstract

A common technique for aligning large language models (LLMs) relies on acquiring human preferences by comparing multiple generations conditioned on a fixed context. This only leverages the pairwise comparisons when the generations are placed in an identical context. However, such conditional rankings often fail to capture the complex and multidimensional aspects of human preferences. In this work, we revisit the traditional paradigm of preference acquisition and propose a new axis that is based on eliciting preferences jointly over the instruction-response pairs. While prior preference optimizations are designed for conditional ranking protocols (e.g., DPO), our proposed preference acquisition protocol introduces DOVE, a new preference optimization objective that upweights the joint probability of the chosen instruction-response pair over the rejected instruction-response pair. Interestingly, we find that the LLM trained with joint instruction-response preference data using DOVE outperforms the LLM trained with DPO by 5.2% and 3.3% win-rate for the summarization and open-ended dialogue datasets, respectively. Our findings reveal that joint preferences over instruction and response pairs can significantly enhance the alignment of LLMs by tapping into a broader spectrum of human preference elicitation. The data and code is available at https://github.com/Hritikbansal/dove.
Comparison of LLM alignment methods' win-rates on summarization and helpfulness tasks using ChatGPT.

Overview

  • Introduction of a novel alignment framework, Dove, for LLMs that extends beyond the traditional conditional preference optimization to joint preferences over instruction-response pairs.

  • Dove enables comparisons between different instruction-response pairs, offering a broader view of human preferences and enhancing the alignment of LLMs with human values.

  • Empirical evaluation shows Dove's effectiveness, outperforming traditional methods in aligning LLMs with human preferences in tasks like summarization and open-ended dialogue.

  • The work encourages further research into preference acquisition methodologies and the integration of Dove with future model architectures to better align LLMs with human values.

Introduction

Alignment of LLMs with human preferences is critical for their effective application across a range of tasks. Current alignment techniques, such as Direct Preference Optimization (DPO), primarily rely on acquiring conditional preference rankings based on generating multiple responses to a single instruction. This approach, however, captures a constrained view of human preferences, limiting the preference space to comparisons where responses are generated for identical instructions. This work introduces a novel alignment framework, Dove, which extends the paradigm to joint preferences over instruction-response pairs, enabling a richer apprehension of human preference dimensions not captured by conditional rankings alone.

Joint Preference Acquisition Protocol

This research revisits the traditional conditional preference acquisition paradigm, proposing joint preference acquisition over instruction-response pairs. This approach allows comparison between instruction-response pairs with non-identical instructions, thereby illuminating a broader spectrum of human preference reasoning. Through this method, preferences are acquired by considering pairs of responses to distinct instructions, extending preference elicitation beyond the constraints of identical contexts.

The Dove framework capitalizes on this by proposing an alignment objective that prioritizes the joint probability of chosen instruction-response pairs over the less preferred ones. Notably, this joint preference optimization bridges the gap between existing conditional preference optimization techniques and a more holistic preference acquisition methodology, capturing a diverse array of human evaluative dimensions.

Results and Implications

The empirical evaluation demonstrates Dove's superiority over traditional methods, including DPO, in aligning LLMs with human preferences. When applied to summarization and open-ended dialogue tasks, Dove achieved significant improvements, with win rates surpassing those of LLMs aligned with DPO by 5.2% and 3.3% on the respective tasks. These findings underscore the effectiveness of leveraging joint preferences for a more comprehensive alignment of LLM outputs with human preferences.

Moreover, this work’s exploration into joint preference optimization unveils new paths for preference elicitation, hitherto veiled by conventional alignment protocols based on conditional preference rankings. It encourages a reevaluation of preference acquisition paradigms to foster the development of LLMs that better resonate with diverse human values and intentions.

Future Directions

The introduction of Dove paves the way for further research into preference acquisition and model alignment. Future investigations could delve deep into optimizing the selection of instruction-response pairs for joint preference acquisition, aiming to fine-tune the balance between preference data richness and alignment efficacy. Moreover, exploring the integration of Dove with existing and upcoming model architectures to bolster LLMs' alignment with human values across a wider range of domains remains a promising avenue for continued exploration.

In conclusion, by elucidating the limitations of existing preference acquisition protocols and presenting a robust framework for leveraging joint preferences over instruction and response pairs, this work takes a significant step towards aligning LLMs more closely with intricate dimensions of human preferences. Dove not only demonstrates the potential for enhanced LLM performance across varied tasks through a novel optimization objective but also invites a reimagining of preference acquisition methodologies, opening new frontiers in the alignment of AI systems with human values.

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

GitHub
References
  1. https://api.semanticscholar.org/CorpusID:268232499.

  2. Direct Preference Optimization with an Offset
  3. Anthrophic. Introducing claude. 2023. https://www.anthropic.com/index/introducing-claude.

  4. A General Language Assistant as a Laboratory for Alignment
  5. A General Theoretical Paradigm to Understand Learning from Human Preferences
  6. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
  7. Constitutional AI: Harmlessness from AI Feedback
  8. Peering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models
  9. Visit-bench: A benchmark for vision-language instruction following inspired by real-world use
  10. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345
  11. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901
  12. The future landscape of large language models in medicine. Communications medicine, 3(1):141
  13. Commoncrawl. Common crawl. https://commoncrawl.org. Accessed on March 23

  14. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023. https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm.

  15. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback
  16. KTO: Model Alignment as Prospect Theoretic Optimization
  17. Koala: A dialogue model for academic research. Blog post, April 2023. https://bair.berkeley.edu/blog/2023/04/03/koala/.

  18. Lora: Low-rank adaptation of large language models
  19. Mistral 7B
  20. On the method of paired comparisons. Biometrika, 31(3/4):324–345
  21. RewardBench: Evaluating Reward Models for Language Modeling
  22. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval

  23. Let's Verify Step by Step
  24. Rensis Likert. A technique for the measurement of attitudes. Archives of psychology
  25. Statistical Rejection Sampling Improves Preference Optimization
  26. LiPO: Listwise Preference Optimization through Learning-to-Rank
  27. Decoupled Weight Decay Regularization
  28. Llmscore: Unveiling the power of large language models in text-to-image synthesis evaluation. Advances in Neural Information Processing Systems, 36
  29. WebGPT: Browser-assisted question-answering with human feedback
  30. OpenAI. Gpt-4 technical report
  31. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744
  32. Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive
  33. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
  34. Instruction Tuning with GPT-4
  35. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9
  36. Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  37. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36
  38. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67
  39. Training Language Models with Language Feedback at Scale
  40. Proximal policy optimization algorithms
  41. Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
  42. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021
  43. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca

  44. Gemini: A Family of Highly Capable Multimodal Models
  45. Teknium. Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants, 2023. https://huggingface.co/datasets/teknium/OpenHermes-2.5.

  46. Louis L Thurstone. A law of comparative judgment. In Scaling, pp.  81–92. Routledge
  47. OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset
  48. LLaMA: Open and Efficient Foundation Language Models
  49. Zephyr: Direct distillation of lm alignment
  50. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations
  51. Tl; dr: Mining reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization, pp.  59–63
  52. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl

  53. Beyond Reverse KL: Generalizing Direct Preference Optimization with Diverse Divergence Constraints
  54. Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
  55. How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources
  56. Self-instruct: Aligning language models with self-generated instructions, 2023c
  57. BloombergGPT: A Large Language Model for Finance
  58. Fine-Grained Human Feedback Gives Better Rewards for Language Model Training
  59. WizardLM: Empowering Large Language Models to Follow Complex Instructions
  60. Dynosaur: A dynamic growth paradigm for instruction-tuning data curation
  61. Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts
  62. MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
  63. Group Preference Optimization: Few-Shot Alignment of Large Language Models
  64. Calibrating sequence likelihood improves conditional language generation. In The Eleventh International Conference on Learning Representations
  65. LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
  66. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Show All 66

Test Your Knowledge

You answered out of questions correctly.

Well done!