Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment (2410.02197v3)

Published 3 Oct 2024 in cs.AI, cs.CL, and cs.LG

Abstract: Modeling human preferences is crucial for aligning foundation models with human values. Traditional reward modeling methods, such as the Bradley-Terry (BT) reward model, fall short in expressiveness, particularly in addressing intransitive preferences. In this paper, we introduce preference embedding, an approach that embeds responses into a latent space to capture intricate preference structures efficiently, achieving linear query complexity. Additionally, we propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback (RLHF). Experimental results show that our General Preference embedding Model (GPM) consistently outperforms the BT reward model on the RewardBench benchmark and effectively models cyclic preferences where any BT reward model behaves like a random guess. Furthermore, evaluations on downstream tasks such as AlpacaEval2.0, following the LLM post-training with GPO and our general preference model, reveal performance improvements over BT models. These findings indicate that our method may enhance the alignment of foundation models with nuanced human values. The code is available at https://github.com/general-preference/general-preference-model.

Citations (2)

View on Semantic Scholar

Summary

The paper presents preference representation learning to efficiently capture complex, cyclic preferences beyond traditional models.
It achieves 100% accuracy in modeling intransitive preferences, outperforming classical Bradley-Terry methods.
Evaluations show up to 9.3% performance improvement on language model benchmarks, highlighting its practical impact on alignment.

General Preference Modeling with Preference Representations for Aligning LLMs

The paper "General Preference Modeling with Preference Representations for Aligning LLMs" presents a novel approach to modeling human preferences, which is critical for aligning foundation models with human values. Traditional methodologies, such as the Bradley-Terry (BT) model, often fall short in capturing complex preference structures due to their limitations in expressiveness and efficiency, particularly with intransitive preferences.

Overview of the Approach

The authors propose preference representation learning as an innovative solution. This methodology involves embedding responses into a latent space to efficiently capture nuanced preference structures, addressing issues of both expressiveness and computational complexity. Specifically, the paper introduces a preference score-based General Preference Optimization (GPO), extending reward-based reinforcement learning from human feedback to manage cyclic and intransitive preferences, a task where traditional models struggle.

Key Contributions

Preference Representation Learning: The core innovation lies in embedding responses within a multi-dimensional latent space, allowing the model to efficiently capture preference structures with linear query complexity. This embedding method is coupled with a skew-symmetric operator to ensure the expressiveness of complex, cyclic preferences.
Improved Modeling of Cyclic Preferences: The proposed General Preference representation model (GPM) is specially designed to handle intransitive preferences. The model outperforms the BT reward model, achieving 100% accuracy in modeling cyclic preferences, where traditional models perform no better than random guessing.
Application to LLM Alignment: The integration of preference scores into GPO showed substantial improvements in downstream tasks. Evaluations on benchmarks like AlpacaEval2.0 and MT-Bench demonstrated performance enhancements of up to 9.3%, indicating the method's potential to align LLMs more effectively with human preferences.

Experimental Validation

The authors validate their approach through a series of experiments. On the RewardBench benchmark, the GPM consistently outperformed the BT model by margins up to 5.6%. Furthermore, extensive experiments on cyclic preference datasets highlighted the model's capability to accurately predict human preferences in complex scenarios, setting a new standard in terms of accuracy and computational efficiency.

Implications and Future Research

This research has significant theoretical and practical implications. Theoretical advancements include the ability to model and understand complex non-transitive preference patterns. Practically, this translates to more nuanced and effective human-model alignment in AI systems.

Future research directions could explore scaling the model to larger LLMs and more diverse datasets to enhance its applicability. Additionally, further exploration into the optimal dimensionality of latent spaces and embedding strategies could refine the balance between expressiveness and computational efficiency.

Conclusion

The paper makes a substantive contribution to the field of preference modeling by introducing a novel framework that effectively captures complex human preferences. By integrating these insights into LLM alignment processes, the research paves the way for developing AI systems that are better aligned with the subtleties of human values and judgments.

PDF Markdown

Related Papers

GitHub

GitHub - general-preference/general-preference-model: Official implementation of paper "General Preference Modeling with Preference Representations for Aligning Language Models" (https://arxiv.org/abs/2410.02197) (3 stars)

Tweets

https://twitter.com/QuanquanGu/status/1847833533282226516

https://twitter.com/JiafanHe/status/1845630628621881588

https://twitter.com/yifan_zhang_/status/1842115884619624522

https://twitter.com/AdinaYakup/status/1843250027549819023

https://twitter.com/yifan_zhang_/status/1842115906010567027