Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

38 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Online Merging Optimizers for Boosting Rewards and Mitigating Tax in Alignment (2405.17931v1)

Published 28 May 2024 in cs.CL and cs.LG

Abstract: Effectively aligning LLMs with human-centric values while preventing the degradation of abilities acquired through Pre-training and Supervised Fine-tuning (SFT) poses a central challenge in Reinforcement Learning from Human Feedback (RLHF). In this paper, we first discover that interpolating RLHF and SFT model parameters can adjust the trade-off between human preference and basic capabilities, thereby reducing the alignment tax at the cost of alignment reward. Inspired by this, we propose integrating the RL policy and SFT models at each optimization step in RLHF to continuously regulate the training direction, introducing the Online Merging Optimizer. Specifically, we merge gradients with the parameter differences between SFT and pretrained models, effectively steering the gradient towards maximizing rewards in the direction of SFT optimization. We demonstrate that our optimizer works well with different LLM families, such as Qwen and LLaMA, across various model sizes ranging from 1.8B to 8B, various RLHF algorithms like DPO and KTO, and existing model merging methods. It significantly enhances alignment reward while mitigating alignment tax, achieving higher overall performance across 14 benchmarks.

PDF HTML Abstract

Overview of "Online Merging Optimizers for Mitigating Alignment Tax in RLHF Training"

The paper explores the challenge of mitigating "alignment tax" in the training of LLMs using Reinforcement Learning from Human Feedback (RLHF). The alignment tax refers to the degradation of fundamental model abilities that occurs when LLMs are aligned with human preferences via RLHF. The authors propose an innovative solution by integrating offline merging techniques into RLHF optimization steps, leading to the development of Online Merging Optimizers. This approach effectively balances the trade-off between reward maximization and minimization of alignment tax.

The research presents a comprehensive examination and validation of online merging optimizers across various LLM architectures, RLHF algorithms, and model sizes. Experimental results indicate that these optimizers significantly enhance alignment performance and mitigate alignment tax, outperforming traditional regularization techniques and offline model merging methods.

Contributions

Discovery of Parameter Interpolation: The authors initially discover that interpolating RLHF and Supervised Fine-Tuning (SFT) model parameters facilitates a trade-off between human preference alignment and foundational capabilities, effectively reducing alignment tax at some cost to alignment reward.
Online Merging Optimizer Proposal: Building upon the parameter interpolation insight, the paper proposes the Online Merging Optimizer that merges gradients with delta parameters from the SFT model at each optimization step in RLHF. This method steers the gradient to maximize rewards while maintaining alignment with the SFT model's optimization direction.
Broad Experimental Validation: The optimizer is tested across several LLM families, including Qwen and LLaMa, with various model sizes ranging from 1.8B to 8B parameters. It is compatible with different RLHF algorithms such as DPO and KTO and existing model merging methods like DARE and TIES. Results demonstrate superior performance across 14 benchmarks.

Key Results

Benchmark Performance:

The proposed Online Merging Optimizer achieves higher overall performance across 14 benchmarks compared to traditional optimization techniques.

MT-Bench and AlpacaEval 2.0:

It consistently performs better in terms of alignment rewards, as evidenced by MT-Bench and AlpacaEval 2.0 scores, indicating improved alignment with human preferences.

Effectiveness Across Models:

The optimizer shows robustness and effectiveness with different LLM backbones and sizes, demonstrating its general applicability.

Theoretical and Practical Implications

The research presents significant theoretical implications by framing the alignment tax issue within the context of mode connectivity in neural networks. The integration of offline model merging techniques into active training processes opens up new avenues for optimizing RLHF training. Practically, the introduction of Online Merging Optimizers is a step towards producing more balanced and capable LLMs with minimized alignment tax. These optimizers can be readily adopted in various settings, providing more robust models without requiring extensive modifications to existing training pipelines.

Speculations on Future Developments

Given the success of Online Merging Optimizers in mitigating alignment tax, future developments may focus on refining these techniques for even more granular control over the trade-off between reward maximization and tax minimization. Further research can explore:

Application of online merging optimizers in other areas prone to catastrophic forgetting, such as continual learning scenarios.
Hybrid models that combine the advantages of parameter-efficient training techniques like LoRA with online merging.
Enhanced memory efficiency methods to reduce the computational overhead associated with maintaining delta parameters.

Conclusion

This paper presents a well-structured approach to addressing the alignment tax problem in RLHF training of LLMs by introducing Online Merging Optimizers. The results are promising, demonstrating significant performance improvements across multiple benchmarks and alignment tasks. This research contributes valuable insights into optimizing human-aligned AI models and sets a foundation for further advancements in the field.

Limitations

The primary limitation discussed is related to the memory overhead of maintaining delta parameters of the reference model, which might hinder scalability in some scenarios. However, these drawbacks are outweighed by the significant benefits offered by the proposed method in terms of enhanced model performance and alignment capability. Future work could focus on alleviating these limitations through innovative parameter-efficient techniques.

Overall, this research marks a notable step forward in the pursuit of more effective and balanced RLHF training methods.

PDF Markdown Bookmark Chat (Pro)

References (73)

Authors (6)

Keming Lu (35 papers)
Bowen Yu (89 papers)
Fei Huang (408 papers)
Yang Fan (27 papers)
Runji Lin (18 papers)
Chang Zhou (105 papers)

Citations (15)

View on Semantic Scholar

Tweets

https://twitter.com/KemingLu612/status/1795652145225863444

https://twitter.com/fly51fly/status/1795936227105058817

https://twitter.com/knishimae0531/status/1795968986238071273

https://twitter.com/arxivsanitybot/status/1796002094841053439