Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Online Merging Optimizers for Boosting Rewards and Mitigating Tax in Alignment (2405.17931v1)

Published 28 May 2024 in cs.CL and cs.LG
Online Merging Optimizers for Boosting Rewards and Mitigating Tax in Alignment

Abstract: Effectively aligning LLMs with human-centric values while preventing the degradation of abilities acquired through Pre-training and Supervised Fine-tuning (SFT) poses a central challenge in Reinforcement Learning from Human Feedback (RLHF). In this paper, we first discover that interpolating RLHF and SFT model parameters can adjust the trade-off between human preference and basic capabilities, thereby reducing the alignment tax at the cost of alignment reward. Inspired by this, we propose integrating the RL policy and SFT models at each optimization step in RLHF to continuously regulate the training direction, introducing the Online Merging Optimizer. Specifically, we merge gradients with the parameter differences between SFT and pretrained models, effectively steering the gradient towards maximizing rewards in the direction of SFT optimization. We demonstrate that our optimizer works well with different LLM families, such as Qwen and LLaMA, across various model sizes ranging from 1.8B to 8B, various RLHF algorithms like DPO and KTO, and existing model merging methods. It significantly enhances alignment reward while mitigating alignment tax, achieving higher overall performance across 14 benchmarks.

Overview of "Online Merging Optimizers for Mitigating Alignment Tax in RLHF Training"

The paper explores the challenge of mitigating "alignment tax" in the training of LLMs using Reinforcement Learning from Human Feedback (RLHF). The alignment tax refers to the degradation of fundamental model abilities that occurs when LLMs are aligned with human preferences via RLHF. The authors propose an innovative solution by integrating offline merging techniques into RLHF optimization steps, leading to the development of Online Merging Optimizers. This approach effectively balances the trade-off between reward maximization and minimization of alignment tax.

The research presents a comprehensive examination and validation of online merging optimizers across various LLM architectures, RLHF algorithms, and model sizes. Experimental results indicate that these optimizers significantly enhance alignment performance and mitigate alignment tax, outperforming traditional regularization techniques and offline model merging methods.

Contributions

  1. Discovery of Parameter Interpolation: The authors initially discover that interpolating RLHF and Supervised Fine-Tuning (SFT) model parameters facilitates a trade-off between human preference alignment and foundational capabilities, effectively reducing alignment tax at some cost to alignment reward.
  2. Online Merging Optimizer Proposal: Building upon the parameter interpolation insight, the paper proposes the Online Merging Optimizer that merges gradients with delta parameters from the SFT model at each optimization step in RLHF. This method steers the gradient to maximize rewards while maintaining alignment with the SFT model's optimization direction.
  3. Broad Experimental Validation: The optimizer is tested across several LLM families, including Qwen and LLaMa, with various model sizes ranging from 1.8B to 8B parameters. It is compatible with different RLHF algorithms such as DPO and KTO and existing model merging methods like DARE and TIES. Results demonstrate superior performance across 14 benchmarks.

Key Results

  • Benchmark Performance:

The proposed Online Merging Optimizer achieves higher overall performance across 14 benchmarks compared to traditional optimization techniques.

  • MT-Bench and AlpacaEval 2.0:

It consistently performs better in terms of alignment rewards, as evidenced by MT-Bench and AlpacaEval 2.0 scores, indicating improved alignment with human preferences.

  • Effectiveness Across Models:

The optimizer shows robustness and effectiveness with different LLM backbones and sizes, demonstrating its general applicability.

Theoretical and Practical Implications

The research presents significant theoretical implications by framing the alignment tax issue within the context of mode connectivity in neural networks. The integration of offline model merging techniques into active training processes opens up new avenues for optimizing RLHF training. Practically, the introduction of Online Merging Optimizers is a step towards producing more balanced and capable LLMs with minimized alignment tax. These optimizers can be readily adopted in various settings, providing more robust models without requiring extensive modifications to existing training pipelines.

Speculations on Future Developments

Given the success of Online Merging Optimizers in mitigating alignment tax, future developments may focus on refining these techniques for even more granular control over the trade-off between reward maximization and tax minimization. Further research can explore:

  • Application of online merging optimizers in other areas prone to catastrophic forgetting, such as continual learning scenarios.
  • Hybrid models that combine the advantages of parameter-efficient training techniques like LoRA with online merging.
  • Enhanced memory efficiency methods to reduce the computational overhead associated with maintaining delta parameters.

Conclusion

This paper presents a well-structured approach to addressing the alignment tax problem in RLHF training of LLMs by introducing Online Merging Optimizers. The results are promising, demonstrating significant performance improvements across multiple benchmarks and alignment tasks. This research contributes valuable insights into optimizing human-aligned AI models and sets a foundation for further advancements in the field.

Limitations

The primary limitation discussed is related to the memory overhead of maintaining delta parameters of the reference model, which might hinder scalability in some scenarios. However, these drawbacks are outweighed by the significant benefits offered by the proposed method in terms of enhanced model performance and alignment capability. Future work could focus on alleviating these limitations through innovative parameter-efficient techniques.

Overall, this research marks a notable step forward in the pursuit of more effective and balanced RLHF training methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. Better fine-tuning by reducing representational collapse. In International Conference on Learning Representations.
  2. Direct preference optimization with an offset. arXiv preprint arXiv:2402.10571.
  3. Program synthesis with large language models.
  4. A general theoretical paradigm to understand learning from human preferences.
  5. Qwen technical report.
  6. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  7. Constitutional ai: Harmlessness from ai feedback. ArXiv, abs/2212.08073.
  8. Lora learns less and forgets less.
  9. Evaluating large language models trained on code.
  10. Training verifiers to solve math word problems.
  11. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377.
  12. How abilities in large language models are affected by supervised fine-tuning data composition.
  13. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767.
  14. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs.
  15. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475.
  16. Alpacafarm: A simulation framework for methods that learn from human feedback.
  17. The role of permutation invariance in linear mode connectivity of neural networks. arXiv preprint arXiv:2110.06296.
  18. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306.
  19. Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pages 3259–3269. PMLR.
  20. Codeapex: A bilingual programming evaluation benchmark for large language models.
  21. Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31.
  22. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211.
  23. Direct language model alignment from online ai feedback.
  24. Measuring massive multitask language understanding.
  25. Orpo: Monolithic preference optimization without reference model.
  26. Lora: Low-rank adaptation of large language models.
  27. J Stuart Hunter. 1986. The exponentially weighted moving average. Journal of quality technology, 18(4):203–210.
  28. Editing models with task arithmetic. arXiv preprint arXiv:2212.04089.
  29. Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations.
  30. Ai alignment: A comprehensive survey. arXiv preprint arXiv:2310.19852.
  31. Smart: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
  32. Rose: Robust selective fine-tuning for pre-trained language models.
  33. Diederik P. Kingma and Jimmy Ba. 2017. Adam: A method for stochastic optimization.
  34. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526.
  35. Ds-1000: A natural and reliable benchmark for data science code generation.
  36. Mixout: Effective regularization to finetune large-scale pretrained language models.
  37. Zhizhong Li and Derek Hoiem. 2017. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947.
  38. R-drop: Regularized dropout for neural networks.
  39. Mitigating the alignment tax of rlhf.
  40. Spurious feature diversification improves out-of-distribution generalization.
  41. Ilya Loshchilov and Frank Hutter. 2018. Fixing weight decay regularization in adam.
  42. #instag: Instruction tagging for analyzing supervised fine-tuning of large language models. In The Twelfth International Conference on Learning Representations.
  43. Michael Matena and Colin Raffel. 2022. Merging models with fisher-weighted averaging.
  44. Language model alignment with elastic reset.
  45. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
  46. Disentangling length from quality in direct preference optimization.
  47. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems.
  48. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
  49. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010.
  50. Coqa: A conversational question answering challenge.
  51. Proximal policy optimization algorithms.
  52. Preference ranking optimization for human alignment. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18990–18998.
  53. Principle-driven self-alignment of language models from scratch with minimal human supervision. Advances in Neural Information Processing Systems, 36.
  54. Merging by matching models in task parameter subspaces. Transactions on Machine Learning Research.
  55. Zephyr: Direct distillation of lm alignment.
  56. A comprehensive survey of continual learning: Theory, method and application. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  57. Deep neural solver for math word problems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 845–854, Copenhagen, Denmark. Association for Computational Linguistics.
  58. Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966.
  59. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time.
  60. Self-evolved diverse data sampling for efficient instruction tuning.
  61. Self-play preference optimization for language model alignment.
  62. Raise a child in large language model: Towards effective and generalizable fine-tuning.
  63. Is dpo superior to ppo for llm alignment? a comprehensive study. arXiv preprint arXiv:2404.10719.
  64. TIES-merging: Resolving interference when merging models. In Thirty-seventh Conference on Neural Information Processing Systems.
  65. Language models are super mario: Absorbing abilities from homologous models as a free lunch.
  66. Hype: Better pre-trained language model fine-tuning with hidden representation perturbation.
  67. How well do large language models perform in arithmetic tasks?
  68. Rrhf: Rank responses to align language models with human feedback without tears.
  69. Galore: Memory-efficient llm training by gradient low-rank projection.
  70. A survey of large language models. arXiv preprint arXiv:2303.18223.
  71. Weak-to-strong extrapolation expedites alignment. arXiv preprint arXiv:2404.16792.
  72. Secrets of rlhf in large language models part i: Ppo. arXiv preprint arXiv:2307.04964.
  73. Instruction-following evaluation for large language models.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Keming Lu (35 papers)
  2. Bowen Yu (89 papers)
  3. Fei Huang (408 papers)
  4. Yang Fan (27 papers)
  5. Runji Lin (18 papers)
  6. Chang Zhou (105 papers)
Citations (15)