Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization (2404.00530v1)

Published 31 Mar 2024 in cs.CL, cs.AI, and cs.LG

Abstract: A common technique for aligning LLMs relies on acquiring human preferences by comparing multiple generations conditioned on a fixed context. This only leverages the pairwise comparisons when the generations are placed in an identical context. However, such conditional rankings often fail to capture the complex and multidimensional aspects of human preferences. In this work, we revisit the traditional paradigm of preference acquisition and propose a new axis that is based on eliciting preferences jointly over the instruction-response pairs. While prior preference optimizations are designed for conditional ranking protocols (e.g., DPO), our proposed preference acquisition protocol introduces DOVE, a new preference optimization objective that upweights the joint probability of the chosen instruction-response pair over the rejected instruction-response pair. Interestingly, we find that the LLM trained with joint instruction-response preference data using DOVE outperforms the LLM trained with DPO by 5.2% and 3.3% win-rate for the summarization and open-ended dialogue datasets, respectively. Our findings reveal that joint preferences over instruction and response pairs can significantly enhance the alignment of LLMs by tapping into a broader spectrum of human preference elicitation. The data and code is available at https://github.com/Hritikbansal/dove.

References (66)

Authors (6)

Hritik Bansal (38 papers)
Ashima Suvarna (8 papers)
Gantavya Bhatt (13 papers)
Nanyun Peng (205 papers)
Kai-Wei Chang (292 papers)
Aditya Grover (82 papers)

Citations (7)

View on Semantic Scholar

Summary

Aligning LLMs to Human Preferences through Dove: A Framework for Joint Preference Optimization

Introduction

Alignment of LLMs with human preferences is critical for their effective application across a range of tasks. Current alignment techniques, such as Direct Preference Optimization (DPO), primarily rely on acquiring conditional preference rankings based on generating multiple responses to a single instruction. This approach, however, captures a constrained view of human preferences, limiting the preference space to comparisons where responses are generated for identical instructions. This work introduces a novel alignment framework, Dove, which extends the paradigm to joint preferences over instruction-response pairs, enabling a richer apprehension of human preference dimensions not captured by conditional rankings alone.

Joint Preference Acquisition Protocol

This research revisits the traditional conditional preference acquisition paradigm, proposing joint preference acquisition over instruction-response pairs. This approach allows comparison between instruction-response pairs with non-identical instructions, thereby illuminating a broader spectrum of human preference reasoning. Through this method, preferences are acquired by considering pairs of responses to distinct instructions, extending preference elicitation beyond the constraints of identical contexts.

The Dove framework capitalizes on this by proposing an alignment objective that prioritizes the joint probability of chosen instruction-response pairs over the less preferred ones. Notably, this joint preference optimization bridges the gap between existing conditional preference optimization techniques and a more holistic preference acquisition methodology, capturing a diverse array of human evaluative dimensions.

Results and Implications

The empirical evaluation demonstrates Dove's superiority over traditional methods, including DPO, in aligning LLMs with human preferences. When applied to summarization and open-ended dialogue tasks, Dove achieved significant improvements, with win rates surpassing those of LLMs aligned with DPO by 5.2% and 3.3% on the respective tasks. These findings underscore the effectiveness of leveraging joint preferences for a more comprehensive alignment of LLM outputs with human preferences.

Moreover, this work’s exploration into joint preference optimization unveils new paths for preference elicitation, hitherto veiled by conventional alignment protocols based on conditional preference rankings. It encourages a reevaluation of preference acquisition paradigms to foster the development of LLMs that better resonate with diverse human values and intentions.

Future Directions

The introduction of Dove paves the way for further research into preference acquisition and model alignment. Future investigations could delve deep into optimizing the selection of instruction-response pairs for joint preference acquisition, aiming to fine-tune the balance between preference data richness and alignment efficacy. Moreover, exploring the integration of Dove with existing and upcoming model architectures to bolster LLMs' alignment with human values across a wider range of domains remains a promising avenue for continued exploration.

In conclusion, by elucidating the limitations of existing preference acquisition protocols and presenting a robust framework for leveraging joint preferences over instruction and response pairs, this work takes a significant step towards aligning LLMs more closely with intricate dimensions of human preferences. Dove not only demonstrates the potential for enhanced LLM performance across varied tasks through a novel optimization objective but also invites a reimagining of preference acquisition methodologies, opening new frontiers in the alignment of AI systems with human values.

PDF Markdown

Related Papers

GitHub

GitHub - Hritikbansal/dove (13 stars)

Tweets

https://twitter.com/hbXNov/status/1775208088146251796

https://twitter.com/suvarna_ashima/status/1796670155675865277

https://twitter.com/fly51fly/status/1777090499083653323

https://twitter.com/GAIS_jp/status/1783043245355782353

https://twitter.com/suvarna_ashima/status/1829585410101526744

https://twitter.com/hbXNov/status/1923073790843531359