Multi-Task Reward Learning from Human Ratings (2506.09183v2)

Published 10 Jun 2025 in cs.LG and cs.AI

Abstract: Reinforcement learning from human feedback (RLHF) has become a key factor in aligning model behavior with users' goals. However, while humans integrate multiple strategies when making decisions, current RLHF approaches often simplify this process by modeling human reasoning through isolated tasks such as classification or regression. In this paper, we propose a novel reinforcement learning (RL) method that mimics human decision-making by jointly considering multiple tasks. Specifically, we leverage human ratings in reward-free environments to infer a reward function, introducing learnable weights that balance the contributions of both classification and regression models. This design captures the inherent uncertainty in human decision-making and allows the model to adaptively emphasize different strategies. We conduct several experiments using synthetic human ratings to validate the effectiveness of the proposed approach. Results show that our method consistently outperforms existing rating-based RL methods, and in some cases, even surpasses traditional RL approaches.

PDF Abstract

Multi-Task Reward Learning from Human Ratings: Advancements in Reinforcement Learning from Human Feedback

In the field of reinforcement learning (RL) focused on human feedback (RLHF), aligning AI model behavior with human expectations has been a primary objective. Traditional approaches typically employ isolated tasks, such as classification or regression, which may oversimplify the complexities of human decision-making processes. The paper "Multi-Task Reward Learning from Human Ratings" proposes a novel approach to RLHF that aims to more closely resemble the multifaceted nature of human reasoning by integrating both classification and regression models in a unified framework. This is achieved through leveraging human ratings in environments without predefined rewards, thereby inferring a reward function in a more nuanced and adaptive manner.

Key Contributions and Methodology

Unified Framework for Reward Learning: The paper introduces a multi-task approach that incorporates human ratings to train a reward prediction model. This model dynamically balances classification and regression objectives, allowing it to capture both discrete and scalar aspects of human feedback. The use of learnable weights to reflect uncertainty between these tasks enables the framework to adaptively emphasize different strategies as needed.
Novel Reward Mapping: A significant contribution of this work is the transformation of discrete human ratings into continuous reward signals using a logarithmic mapping strategy. This mapping improves the granularity and differentiation of reward signals, allowing for more effective policy updates than traditional classification-only methods, which often fail to account for the ordinal relationships between rating classes.
Empirical Evaluation Across Diverse Environments: The efficacy of the proposed method is validated through extensive experiments conducted in six diverse DeepMind Control environments. These environments range from relatively simple scenarios like Cartpole to complex ones like Quadruped. The results indicate that the implementation not only outperforms existing rating-based RL methods but also exceeds traditional PPO performance under certain configurations.

Implications and Speculations on Future Developments

The implications of this research are noteworthy, as it provides a robust method for incorporating human feedback into RL systems, potentially enhancing the applicability of RLHF in real-world scenarios such as robotics, healthcare, and autonomous systems. By creating a framework that more accurately reflects human decision-making, the proposed approach can lead to safer and more reliable AI systems.

Future developments might explore incorporating this framework into interactive or real-time RL settings, where human feedback dynamically influences agent behavior. Moreover, extending this approach to integrate other forms of human input, such as verbal feedback or gestures, could offer additional pathways for refining agent training.

Conclusion

The work laid out in "Multi-Task Reward Learning from Human Ratings" significantly advances our understanding of how human ratings can be more effectively utilized in RLHF. By bridging multiple learning tasks with adaptive weighting, this paper opens new avenues for developing RL systems that harmonize closer with human judgment and preferences, thus offering promising strategies for improving AI alignment in complex environments.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Mingkang Wu (4 papers)
Devin White (6 papers)
Evelyn Rose (2 papers)
Vernon Lawhern (8 papers)
Nicholas R Waytowich (2 papers)
Yongcan Cao (29 papers)

Related Papers

Fine-Tuning Language Models from Human Preferences (2019)
Deep reinforcement learning from human preferences (2017)
Multi-turn Reinforcement Learning from Preference Human Feedback (2024)
Reward-Robust RLHF in LLMs (2024)
Generative Reward Models (2024)

Find Related Papers

YouTube

Show All Videos