Aligning to Thousands of Preferences via System Message Generalization (2405.17977v2)

Published 28 May 2024 in cs.CL

Abstract: Although humans inherently have diverse values, current LLM alignment methods often assume that aligning LLMs with the general public's preferences is optimal. A major challenge in adopting a more individualized approach to LLM alignment is its lack of scalability, as it involves repeatedly acquiring preference data and training new reward models and LLMs for each individual's preferences. To address these challenges, we propose a new paradigm where users specify what they value most within the system message, steering the LLM's generation behavior to better align with the user's intentions. However, a naive application of such an approach is non-trivial since LLMs are typically trained on a uniform system message (e.g., "You are a helpful assistant") which limits their ability to generalize to diverse, unseen system messages. To improve this generalization, we create the Multifaceted Collection, a preference dataset with 192k combinations of values beyond generic helpfulness and harmlessness, spanning 65k user instructions. Using this dataset, we train a 7B LLM called Janus and test it on 921 prompts from 5 benchmarks (AlpacaEval 2.0, FLASK, Koala, MT-Bench, and Self-Instruct) by adding various unseen system messages that reflect user preferences. Janus achieves tie+win rate of 75.2%, 72.4%, and 66.4% against Mistral 7B Instruct v0.2, GPT-3.5 Turbo, and GPT-4, respectively. Unexpectedly, on three benchmarks focused on response helpfulness (AlpacaEval 2.0, MT-Bench, Arena Hard Auto v0.1), Janus also outperforms LLaMA 3 8B Instruct by a +4.0%, +0.1%, +3.0% margin, underscoring that training with a vast array of system messages could also enhance alignment to the general public's preference as well. Our code, dataset, benchmark, and models are available at https://github.com/kaistAI/Janus.

PDF HTML Abstract

Aligning to Thousands of Preferences via System Message Generalization

The paper “Aligning to Thousands of Preferences via System Message Generalization” by Seongyun Lee et al. addresses a significant issue in the alignment of LLMs with user preferences. Traditionally, aligning LLMs has involved gathering high-level preference data such as helpfulness and harmlessness, converting this data into reward models (RMs), and subsequently training the LLMs. However, these approaches often fail to capture the diversity and nuance in individual human preferences.

Problem Statement

The core issue the paper highlights is the impracticality of scaling personalized reinforcement learning from human feedback (RLHF) to cater to the diverse preferences of individual users. The traditional paradigms are not only resource-intensive but are also limited due to their approach of training LLMs based on generalized, often ambiguous, user preferences.

Methodology

To address this issue, the authors introduce a novel paradigm that leverages system messages to guide LLM behavior. Users specify their preferences within these system messages, allowing the LLM to generate responses that align better with individual intentions. However, most LLMs are typically trained on uniform system messages, reducing their ability to generalize to diverse, unseen system messages.

For this purpose, the authors created the Multifaceted Collection, a comprehensive dataset encompassing 192k combinations of user preferences, covering 65k user instructions. This dataset is designed to span beyond generic helpfulness and harmlessness to include a wide array of nuanced preferences. Using this dataset, the authors trained a 7B parameter LLM called Janus.

Experimental Setup

Janus was evaluated across 921 prompts drawn from five distinct benchmarks: AlpacaEval 2.0, FLASK, Koala, MT-Bench, and Self-Instruct. Each prompt was augmented with diverse, unseen system messages reflecting specific user preferences. The model's performance was benchmarked against several state-of-the-art LLMs, including Mistral 7B Instruct v0.2, GPT-3.5 Turbo, and GPT-4.

Janus demonstrated a tie+win rate of 75.2%, 72.4%, and 66.4% against these models respectively. Notably, Janus outperformed LLaMA 3 8B Instruct on three benchmarks focused on response helpfulness, with margins of +4.0%, +0.1%, and +3.0%.

Results and Analysis

The results underscore the efficacy of training LLMs on a diverse range of system messages. Janus successfully generated personalized responses aligned with user-specific preferences, showcasing superior performance in comparison tests. The model managed to balance between generalization and specialization, delivering high-quality responses across varied contexts. This was evidenced by the direct assessment scores in which Janus achieved an average score of 4.24 out of 5.0, outperforming several contemporary models and closely trailing behind some of the largest and most advanced models like GPT-4-Turbo-0125.

Further, the robustness of Janus was validated through the low toxicity scores in the RealToxicityPrompts benchmark, ensuring that the enhanced personalization did not come at the cost of increased harmfulness.

Implications and Future Directions

The implications of this research are substantial for both practical applications and theoretical advancements in AI. On a practical level, the ability to align LLMs with individualized preferences without retraining for each user represents a significant leap in creating more responsive, user-tailored AI systems. This has potential applications across domains such as personal assistants, educational tools, and customer service platforms.

Theoretically, the paper introduces a scalable solution to the problem of preference diversity in LLM training. By demonstrating that training with diverse system messages can lead not only to improved alignment with specific user preferences but also enhance general alignment capabilities, the authors set the stage for more sophisticated, adaptive AI models.

Future work could delve into refining the system message generation process, exploring the boundaries of preference diversity that LLMs can handle, and applying this approach to even larger models and more complex user scenarios. Additionally, integrating this method with existing RLHF techniques and exploring the long-term impacts on model behavior across various tasks and user interactions could provide further insights into the robustness and scalability of this paradigm.

In conclusion, this paper contributes a substantive advancement in aligning LLMs with human preferences, paving the way for more personalized and effective AI systems.