Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble (2401.16635v3)

Published 30 Jan 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning LLMs with human values. However, RLHF relies on a reward model that is trained with a limited amount of human preference data, which could lead to inaccurate predictions. As a result, RLHF may produce outputs that are misaligned with human values. To mitigate this issue, we contribute a reward ensemble method that allows the reward model to make more accurate predictions. As using an ensemble of LLM-based reward models can be computationally and resource-expensive, we explore efficient ensemble methods including linear-layer ensemble and LoRA-based ensemble. Empirically, we run Best-of-$n$ and Proximal Policy Optimization with our ensembled reward models, and verify that our ensemble methods help improve the alignment performance of RLHF outputs.

This paper (Zhang et al., 30 Jan 2024 ) addresses a critical issue in Reinforcement Learning from Human Feedback (RLHF): the inaccuracy of the reward model trained on limited human preference data. This inaccuracy can lead to "reward hacking" or "reward overoptimization," where the LLM generates outputs that receive high predicted rewards from the faulty model but are actually misaligned with true human preferences. To mitigate this, the paper proposes using an ensemble of reward models to obtain more accurate and robust reward predictions.

A key challenge in ensembling LLM-based reward models is the high computational and resource cost associated with training and deploying multiple large models. The paper focuses on developing efficient ensemble methods to overcome this limitation.

The paper proposes three architectural designs for reward model ensemble:

  1. Ensemble of Single Reward Models: This is a straightforward approach where kk reward models are trained independently from different random initializations. Each model consists of a Transformer backbone and a linear layer. At inference, all kk models are loaded, and their predictions are ensembled. This provides strong diversity among models but is computationally expensive, requiring training and loading k×(Transformer Parameters+Linear Layer Parameters)k \times (\text{Transformer Parameters} + \text{Linear Layer Parameters}).
    • Implementation Note: This method is resource-intensive during training and inference due to loading multiple large models. The paper notes it was not feasible for their PPO experiments.
  2. Linear-layer Ensemble: To improve efficiency, this method uses a single shared Transformer model and trains kk separate linear layers, each outputting a reward prediction. All kk linear layers and the shared Transformer are trained concurrently. At inference, the single Transformer is loaded, and its final hidden state is fed into the kk linear layers. This significantly reduces the number of parameters to train and load, requiring Transformer Parameters+k×Linear Layer Parameters\text{Transformer Parameters} + k \times \text{Linear Layer Parameters}.
    • Implementation Note: This approach is more memory-efficient for inference as it only loads one Transformer. Training is also potentially faster than training kk full models.
  3. LoRA-based Ensemble: This method aims to strike a balance between efficiency and model diversity. It also uses a shared Transformer backbone and trains kk separate linear layers. Additionally, it adds and trains kk separate LoRA adapters [hu2021lora] to the Transformer layers, one for each ensemble member. The LoRA adapters allow each ensemble member to introduce slight, low-rank modifications to the shared Transformer's weights.
    • Implementation Note: LoRA adapters have significantly fewer parameters than the full Transformer weights. This method requires Transformer Parameters+k×Linear Layer Parameters+k×LoRA Adapter Parameters\text{Transformer Parameters} + k \times \text{Linear Layer Parameters} + k \times \text{LoRA Adapter Parameters}.
    • Training Strategy: The paper describes a specific training procedure for LoRA-based ensemble. They first finetune the shared Transformer along with the kk linear layers using a subset of the preference data, similar to linear-layer ensemble. Then, they use the remaining data to train only the LoRA adapters and the linear layers for each ensemble member, keeping the main Transformer weights frozen. This pre-finetuning step is found to be necessary for the Transformer to be suitable for reward prediction.

For ensembling the predictions from the kk reward models, the paper explores two methods:

  1. Mean Value Prediction: Calculate the average of the kk predicted reward values. This is expected to reduce prediction variance.
  2. Lower Confidence Bound (LCB): Calculate mean(R)βstd(R)\text{mean}(R) - \beta \cdot \text{std}(R), where RR is the set of kk predictions, and β\beta is a hyperparameter. This provides a more conservative estimate of the reward, which can be beneficial in offline settings or for mitigating overoptimization.

The paper evaluates these ensemble methods within the standard RLHF framework using two common approaches:

  • Best-of-nn: Generate nn responses for an instruction and select the one with the highest predicted reward from the ensemble.
  • Proximal Policy Optimization (PPO): Use the ensembled reward model as the reward function to finetune the SFT model using the PPO algorithm.

Experiments are conducted using the AlpacaFarm and MT-Bench datasets and evaluation setups. The base LLM is Llama-7b finetuned via SFT (SFT10k). The reward models are also initialized from SFT10k. The ensemble size kk is set to 3.

Key experimental findings:

  • Using reward model ensembles consistently improves alignment performance compared to using a single reward model, as measured by win rates on AlpacaEval and scores on MT-Bench. This confirms that ensembling helps mitigate the negative effects of single reward model inaccuracies (answering Q1).
  • Among the efficient methods, LoRA-based ensemble generally achieves the best performance, closely matching or sometimes exceeding the performance of the full ensemble of single models (answering Q2). Linear-layer ensemble also improves performance but is slightly behind LoRA-based and full ensembles.
  • Both Mean Value Prediction and Lower Confidence Bound (LCB) prediction methods yield similar performance in their experiments (answering Q3). This suggests that for their setup, simply averaging the predictions is sufficient to gain the benefits of ensembling.

Practical Implementation Considerations:

  • Computational Trade-offs:
    • Ensemble of Single Reward Models: Highest training time (k independent runs) and highest inference memory (load k full models). Best for scenarios with abundant resources or where maximum performance is critical regardless of cost.
    • Linear-layer Ensemble: Reduced training parameters (shared transformer) and lowest inference memory (load 1 transformer + k small layers). Most efficient in terms of memory, good for deployment on limited hardware.
    • LoRA-based Ensemble: Moderate training parameters (shared transformer + k adapters) and moderate inference memory (load 1 transformer + k adapters + k small layers). Offers a good balance of performance and efficiency.
  • Dataset Management: For LoRA-based ensemble, splitting the training data for pre-finetuning the shared backbone and then training the adapters and linear layers is a practical detail crucial for its performance.
  • Hyperparameters: The paper provides detailed hyperparameters used for reward modeling (with/without LoRA), PPO, and decoding, which are essential for reproducing their results and implementing the methods.
  • Integration into RLHF pipelines: The ensembled reward prediction function replaces the single reward model prediction call within the standard Best-of-nn selection or PPO optimization loop. For PPO, the efficiency of LoRA/Linear methods makes ensemble training feasible, unlike the full ensemble.

In summary, the paper demonstrates that using efficient reward model ensemble techniques, particularly the LoRA-based approach, is a practical and effective way to improve the alignment of LLMs in RLHF by producing more robust reward signals, without incurring the prohibitive costs of ensembling multiple full models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565.
  2. Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback. ArXiv:2307.15217 [cs].
  3. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pages 4299–4307.
  4. Reward Model Ensembles Help Mitigate Overoptimization. ArXiv:2310.02743 [cs].
  5. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback.
  6. Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking. ArXiv:2312.09244 [cs].
  7. Scaling Laws for Reward Model Overoptimization.
  8. Adam Gleave and Geoffrey Irving. 2022. Uncertainty Estimation for Language Reward Models. ArXiv:2203.07472 [cs].
  9. REALM: Retrieval-Augmented Language Model Pre-Training. ArXiv:2002.08909 [cs].
  10. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  11. When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems, pages 12519–12530.
  12. W. Bradley Knox and Peter Stone. 2009. Interactively Shaping Agents via Human Reinforcement: The TAMER Framework. In Proceedings of the Fifth International Conference on Knowledge Capture, K-CAP ’09, pages 9–16, New York, NY, USA.
  13. Large Language Models are Zero-Shot Reasoners. ArXiv:2205.11916 [cs].
  14. Conservative Q-Learning for Offline Reinforcement Learning. Neural Information Processing Systems.
  15. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. ArXiv:1612.01474 [cs, stat].
  16. AI safety gridworlds. arXiv preprint arXiv:1711.09883.
  17. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. arXiv:2005.01643 [cs, stat]. ArXiv: 2005.01643.
  18. Competition-level code generation with alphacode. Science, 378(6624):1092–1097. Publisher: American Association for the Advancement of Science.
  19. Reward Uncertainty for Exploration in Preference-based Reinforcement Learning. ArXiv:2205.12401 [cs].
  20. LLM+P: Empowering Large Language Models with Optimal Planning Proficiency. ArXiv:2304.11477 [cs].
  21. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. ArXiv:2203.13474 [cs].
  22. Training language models to follow instructions with human feedback. ArXiv:2203.02155 [cs].
  23. WARM: On the Benefits of Weight Averaged Reward Models. ArXiv:2401.12187 [cs].
  24. Proximal Policy Optimization Algorithms. arXiv:1707.06347 [cs]. ArXiv: 1707.06347.
  25. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  26. Attention Is All You Need. arXiv:1706.03762 [cs]. ArXiv: 1706.03762.
  27. Fine-Grained Human Feedback Gives Better Rewards for Language Model Training. ArXiv:2306.01693 [cs].
  28. Uncertainty-Penalized Reinforcement Learning from Human Feedback with Diverse Reward LoRA Ensembles. ArXiv:2401.00243 [cs].
  29. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. arXiv preprint arXiv:2306.05685.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Shun Zhang (105 papers)
  2. Zhenfang Chen (36 papers)
  3. Sunli Chen (6 papers)
  4. Yikang Shen (62 papers)
  5. Zhiqing Sun (35 papers)
  6. Chuang Gan (195 papers)
Citations (17)
X Twitter Logo Streamline Icon: https://streamlinehq.com