BOND: Aligning LLMs with Best-of-N Distillation (2407.14622v1)

Published 19 Jul 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Reinforcement learning from human feedback (RLHF) is a key driver of quality and safety in state-of-the-art LLMs. Yet, a surprisingly simple and strong inference-time strategy is Best-of-N sampling that selects the best generation among N candidates. In this paper, we propose Best-of-N Distillation (BOND), a novel RLHF algorithm that seeks to emulate Best-of-N but without its significant computational overhead at inference time. Specifically, BOND is a distribution matching algorithm that forces the distribution of generations from the policy to get closer to the Best-of-N distribution. We use the Jeffreys divergence (a linear combination of forward and backward KL) to balance between mode-covering and mode-seeking behavior, and derive an iterative formulation that utilizes a moving anchor for efficiency. We demonstrate the effectiveness of our approach and several design choices through experiments on abstractive summarization and Gemma models. Aligning Gemma policies with BOND outperforms other RLHF algorithms by improving results on several benchmarks.

PDF HTML Abstract

Review: BOND: Aligning LLMs with N Distillation

The paper "BOND: Aligning LLMs with N Distillation" presents a novel approach to finetuning LLMs using reinforcement learning from human feedback (RLHF). The proposed method, called N Distillation (BOND), aims to emulate the N sampling strategy without incurring its computational overhead at inference time. The fundamental motivation behind BOND is to distill the benefits of N sampling—a technique that selects the best generation among N candidates—into a single efficient policy.

Key Contributions

This work is grounded in the context of improving the alignment of LLMs to human preferences through RLHF. The authors identify the challenges associated with conventional methods and propose a new algorithm that integrates the advantages of N sampling while addressing its drawbacks. The key contributions of this paper are as follows:

N Distillation (BOND): A novel RLHF algorithm that aims to match the distribution of an N sampling strategy. This method significantly reduces the computational expense during inference while retaining the performance benefits of N sampling.
Jeffreys Divergence: The authors introduce the use of Jeffreys divergence, a linear combination of forward and backward KL divergence, to better balance mode-seeking and mode-covering behaviors in the policy distribution.
Iterative BOND: An iterative approach that amplifies the N sampling effect without requiring large N values upfront. This iteration involves distilling the N of a moving anchor policy, ensuring a scalable and stable optimization process.
Empirical Validation: Demonstrated effectiveness through experiments in abstractive summarization and conversational models, showing improvements over traditional RLHF methods in both reward-KL trade-offs and benchmark performance.

Numerical Results and Claims

The paper includes extensive experiments validating BOND across different LLM settings. Key numerical outcomes are summarized below:

Abstractive Summarization Experiments: On the XSum dataset, BOND consistently outperformed baselines in optimizing both backward and forward KL divergence. Jeffreys divergence with $\beta=0.5$ effectively minimized both KL components and optimized reward quantiles better than solely forward or backward KL.
Practical Implications of Iterative BOND: The iterative BOND approach allowed scalable amplification of N without presetting large N values. Experiments showed continuous improvement in rewards with a stable and modest KL increase from the reference policy.
Comparison with RLHF Baselines: J-BOND, the practical algorithm derived from BOND principles, showed superior performance against REINFORCE-based RLHF in terms of both reward gain and reward/KL trade-off.

Practical and Theoretical Implications

The practical implications of this research are substantial for the ongoing development and deployment of LLMs. The BOND algorithm, particularly in its iterative form (J-BOND), offers a more efficient route to achieving high-quality, aligned LLMs. This technique reduces the inference cost associated with multiple samplings while maintaining the benefits of N sampling. The method's ability to improve reward/KL trade-offs further indicates that it can produce models that are both high-performing and less prone to overfitting on the reward signal.

From a theoretical standpoint, the research bridges the gap between policy gradient methods and distribution matching, introducing the Jeffreys divergence as a robust objective in the RLHF setting. This divergence effectively balances between exploring various modes of the data (mode-covering) and focusing on generating high-quality outputs (mode-seeking), a crucial advancement for generative models where the quality of the generated content is paramount.

Future Directions

This work opens several avenues for future research:

Enhanced Quantile Estimations: Exploring more sophisticated quantile estimation techniques, including learned quantile models and other hybrid approaches, could further improve the distribution matching process.
Diverse Use Cases: Implementing and testing BOND across a broader range of tasks beyond abstractive summarization and conversational agents, such as in complex interactive environments or domain-specific LLMs.
Hyperparameter Optimization: Systematic exploration of the trade-offs associated with different values of $N$ and other hyperparameters in various RLHF contexts could lead to fine-tuning insights and best practices.
Real-world Applications: Deploying BOND-guided models in real-world settings to evaluate their effectiveness in practical applications, especially where inference efficiency is critical.

In summary, the BOND framework presents a promising step forward in aligning LLMs efficiently and effectively. Its innovative use of divergence measures and iterative amplification strategies provide a robust foundation for future advancements in RLHF and AI alignment. The promising empirical results validate its potential and encourage further exploration and adoption in the field.