Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment (2402.10207v6)

Published 15 Feb 2024 in cs.LG, cs.AI, and cs.CL

Abstract: We consider the problem of multi-objective alignment of foundation models with human preferences, which is a critical step towards helpful and harmless AI systems. However, it is generally costly and unstable to fine-tune large foundation models using reinforcement learning (RL), and the multi-dimensionality, heterogeneity, and conflicting nature of human preferences further complicate the alignment process. In this paper, we introduce Rewards-in-Context (RiC), which conditions the response of a foundation model on multiple rewards in its prompt context and applies supervised fine-tuning for alignment. The salient features of RiC are simplicity and adaptivity, as it only requires supervised fine-tuning of a single foundation model and supports dynamic adjustment for user preferences during inference time. Inspired by the analytical solution of an abstracted convex optimization problem, our dynamic inference-time adjustment method approaches the Pareto-optimal solution for multiple objectives. Empirical evidence demonstrates the efficacy of our method in aligning both LLMs and diffusion models to accommodate diverse rewards with only around 10% GPU hours compared with multi-objective RL baseline.

Citations (38)

View on Semantic Scholar

Summary

The paper introduces Rewards-in-Context, a method that conditions models on multiple rewards to efficiently align outputs with diverse human preferences.
The paper employs a three-stage process—offline fine-tuning, online Pareto refinement, and dynamic inference—to manage conflicting objectives with minimal computational overhead.
The paper validates the approach on both language and diffusion models, providing a robust theoretical framework and practical insights for customizable, human-aligned AI systems.

Multi-objective Alignment of Foundation Models with Rewards-in-Context Method

Introduction

Recent advancements in LLMs (LMs) development have spotlighted the alignment of these models to human values and preferences, a cornerstone for fostering AI systems that are both helpful and harmless. At the heart of these efforts is Reinforcement Learning from Human Feedback (RLHF), a paradigm that enables the fine-tuning of foundation models to better reflect varied human preferences. Despite its potential, the inherent heterogeneity, multidimensionality, and occasionally conflicting nature of human preferences present considerable challenges to this alignment process. This paper introduces Rewards-in-Context (RiC), a novel approach that conditions the response of a foundation model on multiple rewards in its prompt context, applying supervised fine-tuning for alignment.

Background

The need for an efficient alignment process is underscored by the complexity introduced by the conflicting nature of human preferences. While existing works, like MORLHF or Rewarded Soups, have made strides towards optimizing for multiple objectives using RLHF, these approaches often involve substantial computational resources and fail to dynamically adjust to diverse human preferences during inference. The paper positions RiC in this landscape, critiquing the limitations of linear scalarization methods and emphasizing the necessity for more nuanced models that account for the dynamic nature of human values.

RiC Algorithm

RiC proposes a simple yet adaptable model that restructures the alignment problem into three key stages: an offline training stage leveraging multi-reward conditional supervised fine-tuning, an online training stage for empirical Pareto front refinement, and an inference stage that offers flexibility in aligning with user preferences. Notably, RiC employs an innovative dynamic inference-time adjustment method that orients towards the Pareto-optimal solution for multiple objectives, achieving superior alignment with less computational overhead compared to traditional MORLHF approaches.

Empirical Evaluation

RiC's efficacy is empirically validated on LLMs and diffusion models across tasks, demonstrating its capability to efficiently align with diverse rewards while substantially reducing the necessary computational resources. The experiments showcase RiC's superiority in achieving better alignment across a spectrum of preferences with only around 10% of the GPU hours required by conventional MORLHF baselines. These results underscore the promise of RiC in facilitating more nuanced, human-aligned AI systems with a fraction of the computational cost typically involved.

Theoretical Insights

Beyond empirical results, the paper presents a rigorous analytical framework for understanding the dynamics of the preference-to-reward mapping process underpinning the RiC algorithm. This framework unveils the mechanism by which RiC can flexibly align model outputs with a broad range of human preferences, advancing our theoretical understanding of efficient multi-objective alignment in foundation models.

Future Directions

Looking ahead, the paper speculates on the broader implications of RiC for the development of customizable AI systems, suggesting potential for future work in expanding the algorithm's capability for even more nuanced adjustments to user preferences. It raises important questions about the scalability of RiC, its application beyond text and image generation tasks, and the exploration of alternative preference-to-reward mapping strategies that could further enhance model alignment with human values.

In summary, Rewards-in-Context represents a significant step forward in the endeavor to align foundation models with human preferences. Its combination of simplicity, adaptability, and computational efficiency opens new avenues for developing AI systems that are both beneficial and aligned with diverse human values, highlighting the potential for further innovations in multi-objective model alignment.

PDF Markdown

Related Papers

Tweets

https://twitter.com/RuiYang70669025/status/1786989184982687860

https://twitter.com/mctalentowen/status/1798671587908202607

https://twitter.com/astra_yogee/status/1758347821462290700