This paper (Zhang et al., 30 Jan 2024 ) addresses a critical issue in Reinforcement Learning from Human Feedback (RLHF): the inaccuracy of the reward model trained on limited human preference data. This inaccuracy can lead to "reward hacking" or "reward overoptimization," where the LLM generates outputs that receive high predicted rewards from the faulty model but are actually misaligned with true human preferences. To mitigate this, the paper proposes using an ensemble of reward models to obtain more accurate and robust reward predictions.
A key challenge in ensembling LLM-based reward models is the high computational and resource cost associated with training and deploying multiple large models. The paper focuses on developing efficient ensemble methods to overcome this limitation.
The paper proposes three architectural designs for reward model ensemble:
- Ensemble of Single Reward Models: This is a straightforward approach where reward models are trained independently from different random initializations. Each model consists of a Transformer backbone and a linear layer. At inference, all models are loaded, and their predictions are ensembled. This provides strong diversity among models but is computationally expensive, requiring training and loading .
- Implementation Note: This method is resource-intensive during training and inference due to loading multiple large models. The paper notes it was not feasible for their PPO experiments.
- Linear-layer Ensemble: To improve efficiency, this method uses a single shared Transformer model and trains separate linear layers, each outputting a reward prediction. All linear layers and the shared Transformer are trained concurrently. At inference, the single Transformer is loaded, and its final hidden state is fed into the linear layers. This significantly reduces the number of parameters to train and load, requiring .
- Implementation Note: This approach is more memory-efficient for inference as it only loads one Transformer. Training is also potentially faster than training full models.
- LoRA-based Ensemble: This method aims to strike a balance between efficiency and model diversity. It also uses a shared Transformer backbone and trains separate linear layers. Additionally, it adds and trains separate LoRA adapters [hu2021lora] to the Transformer layers, one for each ensemble member. The LoRA adapters allow each ensemble member to introduce slight, low-rank modifications to the shared Transformer's weights.
- Implementation Note: LoRA adapters have significantly fewer parameters than the full Transformer weights. This method requires .
- Training Strategy: The paper describes a specific training procedure for LoRA-based ensemble. They first finetune the shared Transformer along with the linear layers using a subset of the preference data, similar to linear-layer ensemble. Then, they use the remaining data to train only the LoRA adapters and the linear layers for each ensemble member, keeping the main Transformer weights frozen. This pre-finetuning step is found to be necessary for the Transformer to be suitable for reward prediction.
For ensembling the predictions from the reward models, the paper explores two methods:
- Mean Value Prediction: Calculate the average of the predicted reward values. This is expected to reduce prediction variance.
- Lower Confidence Bound (LCB): Calculate , where is the set of predictions, and is a hyperparameter. This provides a more conservative estimate of the reward, which can be beneficial in offline settings or for mitigating overoptimization.
The paper evaluates these ensemble methods within the standard RLHF framework using two common approaches:
- Best-of-: Generate responses for an instruction and select the one with the highest predicted reward from the ensemble.
- Proximal Policy Optimization (PPO): Use the ensembled reward model as the reward function to finetune the SFT model using the PPO algorithm.
Experiments are conducted using the AlpacaFarm and MT-Bench datasets and evaluation setups. The base LLM is Llama-7b finetuned via SFT (SFT10k). The reward models are also initialized from SFT10k. The ensemble size is set to 3.
Key experimental findings:
- Using reward model ensembles consistently improves alignment performance compared to using a single reward model, as measured by win rates on AlpacaEval and scores on MT-Bench. This confirms that ensembling helps mitigate the negative effects of single reward model inaccuracies (answering Q1).
- Among the efficient methods, LoRA-based ensemble generally achieves the best performance, closely matching or sometimes exceeding the performance of the full ensemble of single models (answering Q2). Linear-layer ensemble also improves performance but is slightly behind LoRA-based and full ensembles.
- Both Mean Value Prediction and Lower Confidence Bound (LCB) prediction methods yield similar performance in their experiments (answering Q3). This suggests that for their setup, simply averaging the predictions is sufficient to gain the benefits of ensembling.
Practical Implementation Considerations:
- Computational Trade-offs:
- Ensemble of Single Reward Models: Highest training time (k independent runs) and highest inference memory (load k full models). Best for scenarios with abundant resources or where maximum performance is critical regardless of cost.
- Linear-layer Ensemble: Reduced training parameters (shared transformer) and lowest inference memory (load 1 transformer + k small layers). Most efficient in terms of memory, good for deployment on limited hardware.
- LoRA-based Ensemble: Moderate training parameters (shared transformer + k adapters) and moderate inference memory (load 1 transformer + k adapters + k small layers). Offers a good balance of performance and efficiency.
- Dataset Management: For LoRA-based ensemble, splitting the training data for pre-finetuning the shared backbone and then training the adapters and linear layers is a practical detail crucial for its performance.
- Hyperparameters: The paper provides detailed hyperparameters used for reward modeling (with/without LoRA), PPO, and decoding, which are essential for reproducing their results and implementing the methods.
- Integration into RLHF pipelines: The ensembled reward prediction function replaces the single reward model prediction call within the standard Best-of- selection or PPO optimization loop. For PPO, the efficiency of LoRA/Linear methods makes ensemble training feasible, unlike the full ensemble.
In summary, the paper demonstrates that using efficient reward model ensemble techniques, particularly the LoRA-based approach, is a practical and effective way to improve the alignment of LLMs in RLHF by producing more robust reward signals, without incurring the prohibitive costs of ensembling multiple full models.