SALMON: Self-Alignment with Instructable Reward Models (2310.05910v2)

Published 9 Oct 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Supervised Fine-Tuning (SFT) on response demonstrations combined with Reinforcement Learning from Human Feedback (RLHF) constitutes a powerful paradigm for aligning LLM-based AI agents. However, a significant limitation of such an approach is its dependency on high-quality human annotations, making its application to intricate tasks challenging due to difficulties in obtaining consistent response demonstrations and in-distribution response preferences. This paper presents a novel approach, namely SALMON, to align base LLMs with minimal human supervision, using only a small set of human-defined principles, yet achieving superior performance. Central to our approach is an instructable reward model. Trained on synthetic preference data, this model can generate reward scores based on arbitrary human-defined principles. By merely adjusting these principles during the RL training phase, we gain full control over the preferences with the instructable reward model, subsequently influencing the behavior of the RL-trained policy models, and reducing the reliance on the collection of online human preferences. Applying our method to the LLaMA-2-70b base LLM, we developed an AI assistant named Dromedary-2. With only 6 exemplars for in-context learning and 31 human-defined principles, Dromedary-2 significantly surpasses the performance of several state-of-the-art AI systems, including LLaMA-2-Chat-70b, on various benchmark datasets. We have open-sourced the code and model weights to encourage further research into aligning LLM-based AI agents with enhanced supervision efficiency, improved controllability, and scalable oversight.

PDF Abstract

SALMON: Self-Alignment with Principle-Following Reward Models

In the landscape of AI alignment, the standard paradigm pivots on the integration of Supervised Fine-Tuning (SFT) with Reinforcement Learning from Human Feedback (RLHF) to calibrate LLMs with human intentions. Despite its efficacy, the approach grapples with scalability and applicability issues stemming from its reliance on extensive, high-quality human annotations. This document delineates a novel AI alignment framework, SALMON (Self-ALignMent with principle-following reward models), aimed at mitigating these challenges by minimizing human oversight while enhancing model performance.

The cornerstone of SALMON is the principle-following reward model, a significant deviation from conventional methodologies that employ fixed reward models. Here, the reward model leverages synthetic preference data generated from a concise set of human-defined principles. Unlike its predecessors, this model provides response evaluations aligned with arbitrary, adaptable guidelines. The researchers created a proficient AI assistant, Dromedary-2, through the application of SALMON to the LLaMA-2-70b base model. Remarkably, Dromedary-2 decisively outperformed existing RLHF-trained systems across various benchmarks with a foundational dataset comprising only six in-context exemplars and 31 principles.

SALMON expands the scope of AI alignment through an overview of self-alignment techniques and principle-guided learning, sidestepping the iterative collection of in-distribution preference data typical of RLHF. The implications are notable: by simply defining principles that preclude reward-hacking behavior, SALMON circumvents the need for continuous human data collection. This adaptive quality is vividly illustrated in the SALMON framework's capacity to modify model behavior through straightforward instruction updates in the reward model.

Tables within the paper illustrate an incisive comparison of human supervision efficiency and performance metrics with state-of-the-art models. Salvaging LLaMA-2-Chat's success, Dromedary-2—with minimal human input—achieves competitive scores on established benchmarks such as MT-Bench, underscoring its proficiency and highlighting the promise of the SALMON approach for future AI systems.

The research's contribution is two-fold: practically, SALMON is a more scalable AI alignment methodology, while theoretically, it challenges the hegemony of human-centric alignment strategies. Discussion extends to exploring SALMON's potential in enabling scalable oversight, where AI systems progressively self-align with limited human interaction—a pursuit that signifies a pivotal shift in AI safety and reliability paradigms.

This exploration offers a compelling alternative trajectory for AI alignment, emphasizing operational efficiency and adaptability through the deft use of principle-driven learning. By marrying principled alignment with self-improving dynamics, SALMON marks a forward leap in the quest for autonomous, ethically aligned AI systems. However, the exploration of broader application domains, potential integration with multi-modal systems, and long-term sustainability of AI behavior modification remains open for future inquiry.