Group Robust Preference Optimization in Reward-free RLHF (2405.20304v1)

Published 30 May 2024 in cs.CL and cs.LG

Abstract: Adapting LLMs for specific tasks usually involves fine-tuning through reinforcement learning with human feedback (RLHF) on preference data. While these data often come from diverse labelers' groups (e.g., different demographics, ethnicities, company teams, etc.), traditional RLHF approaches adopt a "one-size-fits-all" approach, i.e., they indiscriminately assume and optimize a single preference model, thus not being robust to unique characteristics and needs of the various groups. To address this limitation, we propose a novel Group Robust Preference Optimization (GRPO) method to align LLMs to individual groups' preferences robustly. Our approach builds upon reward-free direct preference optimization methods, but unlike previous approaches, it seeks a robust policy which maximizes the worst-case group performance. To achieve this, GRPO adaptively and sequentially weights the importance of different groups, prioritizing groups with worse cumulative loss. We theoretically study the feasibility of GRPO and analyze its convergence for the log-linear policy class. By fine-tuning LLMs with GRPO using diverse group-based global opinion data, we significantly improved performance for the worst-performing groups, reduced loss imbalances across groups, and improved probability accuracies compared to non-robust baselines.

PDF Abstract

Group Robust Preference Optimization in Reward-Free RLHF: A Detailed Analysis

The paper at hand, "Group Robust Preference Optimization in Reward-Free RLHF," addresses a significant limitation in the existing methodology of adapting LLMs using reinforcement learning with human feedback (RLHF). Specifically, it tackles the "one-size-fits-all" approach that overlooks the preferences of diverse user groups, leading to a poor alignment of LLMs for certain subgroups.

Core Contribution

The primary contribution of the paper is the introduction of a novel Group Robust Preference Optimization (GRPO) method. The essence of GRPO lies in addressing the issue of preference heterogeneity across different groups. The proposed method builds upon direct preference optimization techniques but distinguishes itself by focusing on the worst-case group performance. This is realized by adaptively and sequentially weighting the importance of different groups, thereby robustly aligning the LLMs to varied preferences.

Methodology

The authors developed a method that adapts the context of the LLM by including group information and optimizes the worst-case alignment performance across all groups. GRPO follows a reward-free framework similar to previous direct preference optimization methods but introduces a robustness component to handle diverse preference distributions.

Key steps in the proposed methodology include:

Problem Formulation: The optimization problem is defined to minimize the worst-case cumulative loss across multiple groups.
Algorithm: An alternating optimization algorithm is proposed where group weights and the policy are updated iteratively using a mirror descent approach.
Theoretical Analysis: The paper examines the convergence properties of the GRPO algorithm for log-linear policy classes, providing rigorous mathematical proofs.

Theoretical and Empirical Findings

Theoretical findings indicate that the GRPO algorithm converges and robustly handles the worst-case scenarios in terms of group loss. Empirical evaluations on both synthetic and real-world datasets demonstrate the superiority of GRPO over traditional, non-robust methods.

Synthetic Data Results

In the synthetic experiments, various scenarios were tested, such as groups with the same response distributions but different sizes, or groups with different response distributions. GRPO consistently outperformed baseline methods, achieving lower worst-case validation losses and reward errors. Particularly in scenarios with imbalanced group sizes and differing responses, the GRPO method showed significant robustness and improved group-level performance.

Real-World Data Results

In the real-world experiments involving GlobalOpinionQA, GRPO was applied to align the Gemma-2B model to the preferences of survey participants from different countries. GRPO markedly reduced loss imbalances among groups and improved performance for the worst-off groups, achieving higher accuracy in preference alignment. This outcome illustrates the practical viability of GRPO in addressing diverse global opinions.

Implications and Future Directions

The implications of the research are multifold:

Fairness: By ensuring equitable performance across all groups, the proposed method addresses potential biases in LLM fine-tuning.
Customization: GRPO provides a framework for task-specific customization of LLMs catering to diverse demographic preferences without disadvantaging any particular subset.
Scalability: The adaptive weighting mechanism is scalable and can be integrated into existing RLHF pipelines efficiently.

Future research directions may involve exploring the application of GRPO in different domains, such as vision and speech processing, where similar preference heterogeneity issues exist. Investigating trade-offs between worst-case and average-case performance to balance overall model efficacy could also be a significant area of exploration. Additionally, extending the theoretical framework to more complex neural architectures and larger scale datasets would provide further validation of GRPO's robustness.

Conclusion

The GRPO method represents a substantive advancement in the field of reward-free RLHF by addressing the critical issue of alignment with diverse user group preferences. The theoretical foundation and empirical validation provided in this paper underscore the potential of GRPO to enhance the fairness and effectiveness of LLM fine-tuning. Researchers and practitioners can build upon this work to develop more inclusive AI systems that better serve the needs of varied user populations.