Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DMoERM: Recipes of Mixture-of-Experts for Effective Reward Modeling (2403.01197v2)

Published 2 Mar 2024 in cs.CL

Abstract: The performance of the reward model (RM) is a critical factor in improving the effectiveness of the LLM during alignment fine-tuning. There remain two challenges in RM training: 1) training the same RM using various categories of data may cause its generalization performance to suffer from multi-task disturbance, and 2) the human annotation consistency rate is generally only $60\%$ to $75\%$, causing training data to contain a lot of noise. To tackle these two challenges, we introduced the idea of Mixture-of-Experts (MoE) into the field of RM for the first time. We propose the Double-Layer MoE RM (DMoERM). The outer layer MoE is a sparse model. After classifying an input into task categories, we route it to the corresponding inner layer task-specific model. The inner layer MoE is a dense model. We decompose the specific task into multiple capability dimensions and individually fine-tune a LoRA expert on each one. Their outputs are then synthesized by an MLP to compute the final rewards. To minimize costs, we call a public LLM API to obtain the capability preference labels. The validation on manually labeled datasets confirms that our model attains superior consistency with human preference and outstrips advanced generative approaches. Meanwhile, through BoN sampling and RL experiments, we demonstrate that our model outperforms state-of-the-art ensemble methods of RM and mitigates the overoptimization problem. Our code and dataset are available at: https://github.com/quanshr/DMoERM-v1.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. A general theoretical paradigm to understand learning from human preferences. CoRR, abs/2310.12036.
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback. CoRR, abs/2204.05862.
  3. Theoretical guarantees on the best-of-n alignment policy. CoRR, abs/2401.01879.
  4. Unified scaling laws for routed language models. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 4057–4086. PMLR.
  5. Reward model ensembles help mitigate overoptimization. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net.
  6. Stablemoe: Stable routing strategy for mixture of experts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 7085–7095. Association for Computational Linguistics.
  7. Safe RLHF: safe reinforcement learning from human feedback. CoRR, abs/2310.12773.
  8. How abilities in large language models are affected by supervised fine-tuning data composition. CoRR, abs/2310.05492.
  9. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 5547–5569. PMLR.
  10. Alpacafarm: A simulation framework for methods that learn from human feedback. CoRR, abs/2305.14387.
  11. Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking. CoRR, abs/2312.09244.
  12. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res., 23:120:1–120:39.
  13. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. CoRR, abs/2209.07858.
  14. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 10835–10866. PMLR.
  15. Reinforced self-training (rest) for language modeling. CoRR, abs/2308.08998.
  16. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  17. Adaptive mixtures of local experts. Neural Comput., 3(1):79–87.
  18. Michael I. Jordan and Robert A. Jacobs. 1994. Hierarchical mixtures of experts and the EM algorithm. Neural Comput., 6(2):181–214.
  19. Critiquellm: Scaling llm-as-critic for effective and explainable evaluation of large language model generation. CoRR, abs/2311.18702.
  20. Gshard: Scaling giant models with conditional computation and automatic sharding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  21. Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models. CoRR, abs/2310.10505.
  22. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018, pages 1930–1939. ACM.
  23. Webgpt: Browser-assisted question-answering with human feedback. CoRR, abs/2112.09332.
  24. OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  25. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
  26. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation AI scale. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 18332–18346. PMLR.
  27. Prajit Ramachandran and Quoc V. Le. 2019. Diversity and depth in per-example routing models. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  28. Proximal policy optimization algorithms. CoRR, abs/1707.06347.
  29. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net.
  30. Mixture-of-experts meets instruction tuning:a winning combination for large language models. CoRR, abs/2305.14705.
  31. Offline RL for natural language generation with implicit language Q learning. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  32. Preference ranking optimization for human alignment. CoRR, abs/2306.17492.
  33. Which tasks should be learned together in multi-task learning? In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 9120–9132. PMLR.
  34. Learning to summarize with human feedback. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  35. Pandalm: An automatic evaluation benchmark for LLM instruction tuning optimization. CoRR, abs/2306.05087.
  36. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
  37. Chathome: Development and evaluation of a domain-specific language model for home renovation. CoRR, abs/2307.15290.
  38. Go wider instead of deeper. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, pages 8779–8787. AAAI Press.
  39. RRHF: rank responses to align language models with human feedback without tears. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, December 10-16, 2023, New Orleans.
  40. Uncertainty-penalized reinforcement learning from human feedback with diverse reward lora ensembles. CoRR, abs/2401.00243.
  41. Secrets of RLHF in large language models part I: PPO. CoRR, abs/2307.04964.
  42. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  43. Fine-tuning language models from human preferences. CoRR, abs/1909.08593.
Citations (7)

Summary

We haven't generated a summary for this paper yet.