Direct Language Model Alignment from Online AI Feedback (2402.04792v2)

Published 7 Feb 2024 in cs.AI, cs.CL, and cs.HC

Abstract: Direct alignment from preferences (DAP) methods, such as DPO, have recently emerged as efficient alternatives to reinforcement learning from human feedback (RLHF), that do not require a separate reward model. However, the preference datasets used in DAP methods are usually collected ahead of training and never updated, thus the feedback is purely offline. Moreover, responses in these datasets are often sampled from a LLM distinct from the one being aligned, and since the model evolves over training, the alignment phase is inevitably off-policy. In this study, we posit that online feedback is key and improves DAP methods. Our method, online AI feedback (OAIF), uses an LLM as annotator: on each training iteration, we sample two responses from the current model and prompt the LLM annotator to choose which one is preferred, thus providing online feedback. Despite its simplicity, we demonstrate via human evaluation in several tasks that OAIF outperforms both offline DAP and RLHF methods. We further show that the feedback leveraged in OAIF is easily controllable, via instruction prompts to the LLM annotator.

PDF HTML Abstract

Overview

The paper introduces "Online AI Feedback" (OAIF), a novel approach to improve the alignment of LLMs with human preferences by incorporating online feedback directly during the training process. The main innovation of OAIF is the use of a LLM as the annotator to provide on-the-fly feedback. In contrast to existing Direct Alignment from Preferences (DAP) methods that utilize offline and typically static datasets for model alignment, OAIF dynamically updates the preferences, ensuring that the feedback is both online and on-policy. This mechanism is shown to mitigate issues of overfitting and reward over-optimization potential in offline methods.

Experimental Approach

The authors examined the advantages of online feedback across three DAP methods: Direct Preference Optimization (DPO), Identity Policy Optimization (IPO), and Sequence Likelihood Calibration with Human Feedback (SLiC). The effectiveness of OAIF was empirically demonstrated through human and automatic evaluations over standard LLM alignment tasks. Notably, models aligned with OAIF displayed significantly improved performance over those trained with offline methods, showcasing OAIF's robustness and efficiency.

Results and Implications

The paper reports convincing numerical results with an average win rate of approximately 66% for online DAP methods over their offline variants. Additionally, when directly compared to other learning algorithms like RLHF and RLAIF, models trained with online DPO and OAIF were preferred over 58% of the time in a 4-way comparison, signifying their superior performance. A key practical advantage is the controllability of OAIF; for instance, preference prompts could be adjusted to favor short responses, effectively reducing average response length from ~120 to ~40 tokens without significantly sacrificing response quality.

Concluding Thoughts

This research makes a substantial contribution to the field of LLM alignment by introducing a simpler and potentially more scalable solution compared to existing methods. OAIF opens avenues for future work where alignment is adjusted dynamically and might help reduce the need for extensive human input in the model training process. The key insight that AI-driven online feedback can effectively align LLMs with human values potentially accelerates efforts to create AI that operates harmoniously within human-centric frameworks.