Papers
Topics
Authors
Recent
2000 character limit reached

SALSA: Soup-based Alignment Learning for Stronger Adaptation in RLHF (2411.01798v1)

Published 4 Nov 2024 in cs.LG

Abstract: In LLM development, Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning models with human values and preferences. RLHF traditionally relies on the Kullback-Leibler (KL) divergence between the current policy and a frozen initial policy as a reference, which is added as a penalty in policy optimization algorithms like Proximal Policy Optimization (PPO). While this constraint prevents models from deviating too far from the initial checkpoint, it limits exploration of the reward landscape, reducing the model's ability to discover higher-quality solutions. As a result, policy optimization is often trapped in a narrow region of the parameter space, leading to suboptimal alignment and performance. This paper presents SALSA (Soup-based Alignment Learning for Stronger Adaptation), a novel approach designed to overcome these limitations by creating a more flexible and better located reference model through weight-space averaging of two independent supervised fine-tuned (SFT) models. This model soup allows for larger deviation in KL divergence and exploring a promising region of the solution space without sacrificing stability. By leveraging this more robust reference model, SALSA fosters better exploration, achieving higher rewards and improving model robustness, out-of-distribution generalization, and performance. We validate the effectiveness of SALSA through extensive experiments on popular open models (Llama2-7B, Mistral-7B, and Gemma-2B) across various benchmarks (MT-Bench, Arena-Hard, UltraFeedback), where it consistently surpasses PPO by fostering deeper exploration and achieving superior alignment in LLMs.

Summary

  • The paper introduces a novel model soup method that overcomes traditional RLHF limitations through weight-space averaging of SFT models.
  • It demonstrates that SALSA outperforms PPO, achieving win rates over 54% across benchmarks like Arena-Hard and MT-Bench.
  • The study opens pathways for future research on optimized weight interpolation and expanded model soup configurations in RLHF.

Analysis of SALSA: A Novel Method for Robust RLHF in LLMs

The field of Reinforcement Learning from Human Feedback (RLHF) has demonstrated potential in aligning LLMs with human values and preferences. Nevertheless, the traditional RLHF approach, which incorporates the Kullback-Leibler (KL) divergence to ensure models remain close to a reference model, often restricts exploration of higher-quality solutions. This limitation can lead to suboptimal model performance, which the authors address in their paper by introducing SALSA (Soup-based Alignment Learning for Stronger Adaptation).

Overview of SALSA Methodology

SALSA aims to transcend the constraints of traditional RLHF by employing a "model soup," which is derived through weight-space averaging of two independently supervised fine-tuned (SFT) models. This approach facilitates the creation of a flexible, robust reference model, allowing the model to explore a broader parameter space while maintaining stability. The model soup harnesses the propensity of models trained from the same initialization to reside within a shared low-error basin. This resulted in the effective utilization of weight interpolation, thereby enhancing robustness and out-of-distribution generalization.

Key Findings

The paper provides comprehensive experimental evidence supporting the efficacy of SALSA across various models, including Llama2-7B, Mistral-7B, and Gemma-2B. Particularly, SALSA demonstrated superior performance over Proximal Policy Optimization (PPO) across benchmarks like MT-Bench, Arena-Hard, and UltraFeedback. Notably, the model soup enabled the exploration of higher reward regions in parameter space, thereby leading to better-aligned and more robust models.

Numerical Results:

  • SALSA achieved win rates of 54.01% for Llama2-7B and 54.40% for Mistral-7B on the Arena-Hard dataset.
  • On the MT-Bench dataset, SALSA achieved an adjusted win rate of 57.19% for Gemma-2B over PPO.

The findings suggest that the reward near the model soup consistently surpasses the original SFT model due to the inherently higher reward landscape, corroborating the benefits of using model soup as a reference point in RLHF frameworks.

Implications and Future Directions

This research emphasizes the importance of dynamic approaches within RLHF to enhance the exploration of potential solutions without sacrificing alignment robustness. SALSA's novel use of model soups presents a formidable advancement in improving the stability, reliability, and performance of RLHF pipelines, making it a superior choice over static models.

The insight into reward dynamics within model soups could propel further innovation in AI by encouraging novel model averaging techniques. Future work could explore larger model soups comprising more than two models, investigate optimized weight interpolation strategies, or apply SALSA techniques to other RLHF models such as Direct Preference Optimization (DPO). Studying the implications of KL-coefficient settings, which SALSA demonstrates as crucial for optimal performance, also presents another worthwhile avenue for exploration.

In summary, SALSA represents a promising advancement in RLHF methodology, providing a foundation for future work aimed at achieving more sophisticated alignment in AI models. This paper not only introduces an effective technique but also opens multiple pathways for further research in advancing AI alignment protocols.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Sign up for free to view the 4 tweets with 76 likes about this paper.