The Perfect Blend: Redefining RLHF with Mixture of Judges (2409.20370v1)

Published 30 Sep 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Reinforcement learning from human feedback (RLHF) has become the leading approach for fine-tuning LLMs (LLM). However, RLHF has limitations in multi-task learning (MTL) due to challenges of reward hacking and extreme multi-objective optimization (i.e., trade-off of multiple and/or sometimes conflicting objectives). Applying RLHF for MTL currently requires careful tuning of the weights for reward model and data combinations. This is often done via human intuition and does not generalize. In this work, we introduce a novel post-training paradigm which we called Constrained Generative Policy Optimization (CGPO). The core of CGPO is Mixture of Judges (MoJ) with cost-efficient constrained policy optimization with stratification, which can identify the perfect blend in RLHF in a principled manner. It shows strong empirical results with theoretical guarantees, does not require extensive hyper-parameter tuning, and is plug-and-play in common post-training pipelines. Together, this can detect and mitigate reward hacking behaviors while reaching a pareto-optimal point across an extremely large number of objectives. Our empirical evaluations demonstrate that CGPO significantly outperforms standard RLHF algorithms like PPO and DPO across various tasks including general chat, STEM questions, instruction following, and coding. Specifically, CGPO shows improvements of 7.4% in AlpacaEval-2 (general chat), 12.5% in Arena-Hard (STEM & reasoning), and consistent gains in other domains like math and coding. Notably, PPO, while commonly used, is prone to severe reward hacking in popular coding benchmarks, which CGPO successfully addresses. This breakthrough in RLHF not only tackles reward hacking and extreme multi-objective optimization challenges but also advances the state-of-the-art in aligning general-purpose LLMs for diverse applications.

PDF HTML Abstract

The Perfect Blend: Redefining RLHF with Mixture of Judges

The paper entitled "The Perfect Blend: Redefining RLHF with Mixture of Judges" introduces a novel paradigm in fine-tuning LLMs using Reinforcement Learning from Human Feedback (RLHF). This paper addresses the inherent challenges in multi-task learning (MTL) associated with reward hacking and extreme multi-objective optimization. To this end, the authors propose a method termed Constrained Generative Policy Optimization (CGPO), which strategically utilizes a Mixture of Judges (MoJ) for cost-efficient constrained policy optimization with stratification.

Key Contributions

Constrained Policy Optimization

The essential innovation of CGPO lies in its use of multiple constraints to mitigate reward hacking. Various constraint satisfaction criteria are defined and validated using a blend of rule-based and LLM-based judges. The core CGPO framework is designed to identify constraint-violating patterns during policy optimization:

Calibrated Regularized Policy Gradient (CRPG): Uses a calibrated reward model to account for value discrepancies across prompts, ensuring consistency and enhancing optimization.
Constrained Online Direct Preference Optimization (CODPO): An online variant of DPO tailored for a constrained setting, maximizing the margin between positive and negative samples while adhering to constraints.
Calibrated Regularized Reward Ranking Finetuning (CRRAFT): Built upon RAFT, reweighs selected samples based on calibrated reward values for fine-tuned optimization.

Multi-Objective Optimization with MoJ

CGPO also incorporates a novel approach for managing diverse tasks:

Prompts are categorized into distinct, non-overlapping tasks.
Each task is associated with a customized policy optimization strategy, encompassing tailored MoJs, reward models, and specific hyperparameters.
By optimizing tasks independently, CGPO avoids the pitfalls of conflicting objective compromises inherent in linear combined reward approaches.

Empirical Results

The proposed CGPO framework demonstrated superior performance across a variety of benchmarks when compared to existing state-of-the-art RLHF algorithms like PPO and DPO. Notable empirical results from the paper include:

AlpacaEval-2 (General chat): Improvements over PPO by 7.4%.
Arena-Hard (STEM content reasoning): Gains of 12.5%.
HumanEval (Coding): Enhanced by 5%.
MATH and GSM8K (Math content reasoning): Increased by 2%.
ARC Challenge (Knowledge): Increased by 2%.

The results highlighted that PPO struggled with reward hacking, particularly in coding benchmarks, while CGPO successfully avoided such pitfalls.

Theoretical Foundations and Practical Implications

The theoretical underpinnings of CGPO come with guarantees for minimizing reward model imperfections and ensuring constraint satisfaction. The use of primal-type constrained RLHF optimizers instead of the traditional primal-dual approach eliminates the need for extensive hyperparameter tuning. This feature offers scalability and user-friendliness in large-scale LLM post-training settings.

On the practical front, CGPO stands out with its plug-and-play nature, seamless integration into existing post-training pipelines, and reduced requirement for extensive tuning. It is particularly well-suited for real-world applications where LLMs need to handle complex, multi-objective tasks.

Future Directions

The research presented in the paper opens up several avenues for future exploration:

Further refinement of MoJs to automate the constraint evaluation process to reduce human intervention.
Investigation into adaptive gradient weights in CGPO for more efficient handling of heterogeneous tasks and complex objectives.
Extension of the CGPO framework to dynamically evolving multi-task environments to improve real-time adaptability and response accuracy.

In summary, the paper provides a robust and innovative solution to the challenges of reward hacking and multi-objective optimization in RLHF for LLMs. Through the introduction of CGPO and the strategic use of MoJs, the authors have laid a solid theoretical and practical foundation for more effective and comprehensively aligned LLMs in diverse applications.