Variational Best-of-N Alignment (2407.06057v1)

Published 8 Jul 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Best-of-N (BoN) is a popular and effective algorithm for aligning LLMs to human preferences. The algorithm works as follows: at inference time, N samples are drawn from the LLM, and the sample with the highest reward, as judged by a reward model, is returned as the output. Despite its effectiveness, BoN is computationally expensive; it reduces sampling throughput by a factor of N. To make BoN more efficient at inference time, one strategy is to fine-tune the LLM to mimic what BoN does during inference. To achieve this, we derive the distribution induced by the BoN algorithm. We then propose to fine-tune the LLM to minimize backward KL divergence to the BoN distribution. Our approach is analogous to mean-field variational inference and, thus, we term it variational BoN (vBoN). To the extent this fine-tuning is successful and we end up with a good approximation, we have reduced the inference cost by a factor of N. Our experiments on a controlled generation task suggest that while variational BoN is not as effective as BoN in aligning LLMs, it is close to BoN performance as vBoN appears more often on the Pareto frontier of reward and KL divergence compared to models trained with KL-constrained RL objective.

PDF HTML Abstract

Variational \beston Alignment: A New Approach to LLM Alignment

The paper "Variational \beston Alignment" by Afra Amini, Tim Vieira, and Ryan Cotterell introduces a novel method for aligning LLMs with human preferences, inspired by the \beston (\bon) algorithm. The authors identify the computational inefficiency of \bon during inference and propose an alternative fine-tuning strategy termed variational \bon (\vbon), which leverages mean-field variational inference to enhance efficiency while preserving alignment efficacy.

Summary

The \bon algorithm is commonly used for aligning LLMs, where $N$ samples are drawn, and the one with the highest reward—determined by a reward model—is selected as the output. Despite its effectiveness, \bon is computationally expensive as it reduces sampling throughput by a factor of $N$ . To mitigate this, the authors derive the distribution induced by the \bon algorithm and fine-tune the LLM to minimize the Kullback-Leibler (KL) divergence to this distribution, resulting in the \vbon approach.

Methodology

The authors' strategy involves:

Understanding \bon Distribution:
- The distribution of \bon is mathematically derived.
- This involves computing the probability that a particular string is selected by \bon given that $N$ samples are drawn.
Variational Approximation:
- The LLM is fine-tuned to approximate the \bon distribution by minimizing the reverse KL divergence.
- This approach is analogous to mean-field variational inference.
Optimization:
- The \vbon objective has distinct properties, such as insensitivity to monotonic transformations of the reward values.
- The optimization is carried out using PPO (Proximal Policy Optimization) to achieve performance approximation close to \bon while significantly reducing inference costs.

Experiments and Results

The experiments focus on generating movie reviews with positive sentiment, utilizing a binary sentiment classifier as the reward model. Several alignment methods were compared:

\bon: Applied at inference to pick the highest reward sample.
\bon-SFT: Fine-tuning based on samples from \bon.
PPO: Using the KL-constrained RL objective.
DPO: Direct preference optimization without reward model training.
BoNBoN: An IPO-like approach that approximates \bon distribution.

The results highlight:

Performance Comparison:
- \bon is the most effective algorithm but with high computational costs.
- \vbon closely approximates \bon efficiency, achieving a good trade-off between reward attainment and computational overhead.
- DPO performs better in terms of win rates but lags in achieving high rewards.
- PPO and \bon-SFT show higher KL divergences, indicating larger deviations from the reference model.
Numerical Insights:
- \vbon often appears on the Pareto frontier of reward and KL divergence, indicating an optimal balance.
- With Monte Carlo estimations, the authors validate the reliability of the \vbon objective with a sufficient sample size (approximately 250 samples).

Implications and Future Directions

The practical implication of this research is the enhancement of computational efficiency in LLM alignment without substantially compromising performance—essential for scalable deployment of LLMs. Theoretically, the variational approach adopted introduces robustness to reward scaling issues, often problematic in RL algorithms.

Looking ahead, this research opens pathways to optimize other inference-time alignment algorithms through variational methods. Future work could explore extending this framework to multi-objective optimization scenarios, enhancing models on diverse human-like metrics simultaneously. Additionally, integrating \vbon with more sophisticated reward models could further enhance alignment precision and robustness in varied application domains.

Conclusion

"Variational \beston Alignment" contributes a significant advancement in aligning LLMs efficiently. By deriving and fine-tuning towards the \bon distribution through a variational approach, the authors present a method that strikes a balance between performance and computational feasibility. This work stands out for its mathematical rigor and practical significance, setting a new direction for research in LLM alignment.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Afra Amini (16 papers)
Tim Vieira (29 papers)
Ryan Cotterell (226 papers)

Citations (8)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/afra_amini/status/1811744244106682532

https://twitter.com/abeirami/status/1935813866891407412

https://twitter.com/fly51fly/status/1812603218188521548

https://twitter.com/menhguin/status/1916694922758918478