Overview of RRHF
Reinforcement Learning from Human Feedback (RLHF) has emerged as a prominent method to align LLMs with human preferences, as seen in prominent instances such as InstructGPT. However, traditional models like Proximal Policy Optimization (PPO), which are used for this purpose, often face scalability issues due to hyperparameter sensitivity and architectural complexity. This paper introduces a new learning paradigm called RRHF (Rank Responses to Align LLMs with Human Feedback) that seeks to alleviate some of the stated challenges. RRHF simplifies the alignment process by utilizing a ranking loss to score and organize model-generated responses to coincide with human preferences.
RRHF vs PPO
RRHF distinguishes itself from PPO by its minimalist approach—it requires only 1 to 2 models during tuning (compared to PPO's 4) and avoids complex reward model training. This is achieved by scoring responses using log probabilities and employing ranking loss for optimization, thus eliminating the need for auxiliary models or KL divergence calculations. The robustness of RRHF is demonstrated through its performance on the Helpful and Harmless dataset, where it matched PPO in performance metrics, both in automated evaluation and human labelling, without the associated complexities.
Experiment Findings
Experiments conducted with RRHF revealed several insights. The approach aligns LLMs efficiently, demonstrated by comparable performance to PPO, but requires significantly fewer resources and less complexity. It also showed that the quality of responses sampled during training has a direct correlation with the performance of the tuned model, indicating the importance of high-quality sampling in the alignment process.
Additionally, to simulate real-world training conditions for models like ChatGPT, the novel LLM Wombat was developed using RRHF. Wombat displayed superior performance over SFT and aligns effectively with human preferences when trained with prompts and responses from other LLMs, showcasing the generalizability of RRHF.
Contributions and Future Work
The key contributions of this paper are the development of RRHF, a simplified and efficient training paradigm, the establishment of RRHF as an extension of SFT, and the demonstration of its comparable performance with PPO on the Helpful and Harmless dataset. These contributions are meaningful as they could potentially lead to easier scaling of LLM alignment to human preferences, especially with limited resources.
In terms of future work, although the RRHF method has shown promise, there is an acknowledgment of limitations such as potential over-optimization and the need for multiple response inputs which may increase GPU usage. Exploring how to address these limitations will be crucial for further optimizing and ensuring the safe deployment of LLMs aligned with human preferences.