Weak-to-Strong Extrapolation Expedites Alignment (2404.16792v2)

Published 25 Apr 2024 in cs.LG, cs.AI, and cs.CL

Abstract: The open-source community is experiencing a surge in the release of LLMs that are trained to follow instructions and align with human preference. However, further training to improve them still requires expensive computational resources and data annotations. Is it possible to bypass additional training and cost-effectively acquire better-aligned models? Inspired by the literature on model interpolation, we propose a simple method called ExPO to boost LLMs' alignment with human preference. Utilizing a model that has undergone alignment training (e.g., via DPO or RLHF) and its initial SFT checkpoint, ExPO directly obtains a better-aligned model by extrapolating from the weights of the initial and the aligned models, which implicitly optimizes the alignment objective via first-order approximation. Through experiments with twelve open-source LLMs on HuggingFace, we demonstrate that ExPO consistently improves off-the-shelf DPO/RLHF models, as evaluated on the mainstream LLM benchmarks AlpacaEval 2.0 and MT-Bench. Moreover, ExPO exhibits remarkable scalability across various model sizes (from 1.8B to 70B) and capabilities. Through controlled experiments and further empirical analyses, we shed light on the essence of ExPO amplifying the reward signal learned during alignment training. Our work demonstrates the efficacy of model extrapolation in expediting the alignment of LLMs with human preference, suggesting a promising direction for future research.

PDF Abstract

ExPO: A Method for Enhancing LLMs’ Alignment with Human Preferences via Model Extrapolation

Introduction and Motivation

The development of LLMs like GPT-4 has included efforts to enhance their alignment with human preferences through stages of Supervised Fine-Tuning (SFT) and subsequent reinforcement learning (RL) or direct preference optimization (DPO). While these alignment strides are substantial, they are often restricted by available resources. This paper introduces ExPO (model extrapolation), a novel technique devised to extrapolate from existing less-aligned and medium-aligned models to generate a superior model exhibiting higher alignment with human preferences, without additional costly training processes. This method leverages insights from model interpolation literature and provides a practical approach to potentially bypass some of the resource-intensive stages of model training.

Methodology

Assumptions and Theoretical Foundations

At its core, ExPO is predicated on the assumption that a medium-aligned model (M) can be interpolated from a less-aligned model (M_w) and a hypothetical better-aligned model (M_s). The model M is understood to be an output of initial alignment processes such as SFT, while M_s represents an achievable, yet not directly trained, superior state of alignment. By manipulating the interpolation coefficients, ExPO aims to reverse-engineer M_s from M and M_w.

Practical Implementation

The extrapolation in ExPO follows a straightforward formulaic adjustment, where the coefficient α (alpha) adjusts the influence of changes derived from model M relative to M_w. This coefficient can be fine-tuned efficiently like a decoding hyperparameter, rendering the process computationally economical and feasible without further extensive training.

Experiments and Results

Overview of Experimental Setup

The paper outlines experiments on varying scales of preference data (10%, 20%, and 100%), using models trained on these datasets, and then applying ExPO to evaluate performance enhancements. Experiments were predominantly evaluated using the AlpacaEval 2.0 benchmark, focusing on comparing length-controlled win rates over a GPT-4 baseline.

Key Findings

Models trained with reduced data sets (10% & 20%), when treated with ExPO, not only reached but occasionally surpassed the performance of models trained with complete data sets (100%). This indicates that ExPO can effectively leverage learned alignment from suboptimal models to approximate and even exceed fully optimized models. Additionally, results showed that larger models exhibited more significant improvements, highlighting ExPO’s scalability.

Theoretical and Practical Implications

ExPO represents an economical and scalable method to enhance LLMs' alignment with human preferences beyond initial training limitations. It suggests that previously underutilized model states, typically considered suboptimal, can serve as fundamental components in developing stronger models. For ongoing and future implementations of LLMs, ExPO offers a pragmatic approach to continual improvement of models in alignment-focused applications.

Future Directions

While the current methodology provides a robust foundation, future work could explore adaptive module-specific extrapolation coefficients, eliminate dependencies on external reward models, and theoretically encapsulate the mechanistic underpinnings of ExPO’s effectiveness. Additionally, exploring the applicability of ExPO across diverse model architectures and multimodal LLMs could broaden its utility in the AI field.

Conclusion

ExPO provides a promising avenue for improving the capabilities of LLMs concerning human preference alignment, utilizing an efficient, straightforward computational approach. This method, by enabling superior model performance without additional data or extensive training, aligns with the economical and practical demands of modern AI research and applications, warranting further exploration and development.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Chujie Zheng (35 papers)
Ziqi Wang (92 papers)
Heng Ji (266 papers)
Minlie Huang (225 papers)
Nanyun Peng (205 papers)

Citations (33)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/kristy_loke/status/1784642650915705306

YouTube

Show All Videos