B-Pref: Benchmarking Preference-Based Reinforcement Learning (2111.03026v1)

Published 4 Nov 2021 in cs.LG, cs.AI, and cs.HC

Abstract: Reinforcement learning (RL) requires access to a reward function that incentivizes the right behavior, but these are notoriously hard to specify for complex tasks. Preference-based RL provides an alternative: learning policies using a teacher's preferences without pre-defined rewards, thus overcoming concerns associated with reward engineering. However, it is difficult to quantify the progress in preference-based RL due to the lack of a commonly adopted benchmark. In this paper, we introduce B-Pref: a benchmark specially designed for preference-based RL. A key challenge with such a benchmark is providing the ability to evaluate candidate algorithms quickly, which makes relying on real human input for evaluation prohibitive. At the same time, simulating human input as giving perfect preferences for the ground truth reward function is unrealistic. B-Pref alleviates this by simulating teachers with a wide array of irrationalities, and proposes metrics not solely for performance but also for robustness to these potential irrationalities. We showcase the utility of B-Pref by using it to analyze algorithmic design choices, such as selecting informative queries, for state-of-the-art preference-based RL algorithms. We hope that B-Pref can serve as a common starting point to study preference-based RL more systematically. Source code is available at https://github.com/rll-research/B-Pref.

Citations (82)

View on Semantic Scholar

Summary

The paper introduces B-Pref as the first benchmark for preference-based RL, addressing the challenges of handcrafted reward functions.
It simulates realistic human feedback, including noise and irrational behaviors, to evaluate algorithms across diverse tasks.
Initial results show that models like PEBBLE excel in robot manipulation, proving the benchmark’s utility for performance assessment.

Benchmarking Preference-Based Reinforcement Learning: An Overview of B-Pref

The paper introduces B-Pref, a benchmark tailored explicitly for preference-based reinforcement learning (RL) algorithms. Preference-based RL emerges as a potent alternative to traditional RL, which requires a well-specified reward function—a challenge in complex environments. This method promises to mitigate problems associated with reward engineering and reward exploitation by leveraging teacher-provided preferences as feedback. However, the lack of an established benchmark for this subset of RL has historically made it difficult to evaluate and compare algorithms systematically. B-Pref addresses this gap by providing a standardized benchmarking framework, which includes various evaluation metrics and simulated human input characterized by different irrationalities and preferences.

Key Contributions and Findings

1. Problem Context and Justification:

Reinforcement learning's reliance on hand-crafted reward functions poses significant obstacles, especially in complex tasks such as autonomous driving or robotic manipulation. Existing RL frameworks can inadvertently exploit these reward functions, leading to unintended outcomes. Preference-based RL leverages user preferences rather than predefined rewards, reducing the pitfalls of reward hacking and making the system more user-aligned.

2. B-Pref Benchmark Characteristics:

B-Pref comprises tasks from the DeepMind Control Suite and Meta-world environments, incorporating both locomotion and robotic manipulation challenges. The paper highlights the importance of simulating human input realistically by considering several behavioral irrationalities, such as noise in preference data, myopic decision-making behaviors, and non-standard reactions to equally plausible actions.

3. Evaluation Mechanism:

B-Pref provides a robust platform to evaluate preference-based RL algorithms' feedback efficiency and robustness. The metrics are normalized concerning traditional RL benchmarks, enabling a straightforward comparison of algorithm performance under realistic, irrational human feedback. This benchmark supports the investigation of various RL policies like PEBBLE and PrefPPO, setting the stage for designers to assess and adapt their algorithmic decisions based on robust evidence.

4. Research Insights:

Initial results using B-Pref reveal distinctive insights, such as PEBBLE demonstrating superior performance across most environments compared to other tested algorithms, especially in tasks involving robot manipulation. Performance resilience is markedly reduced when algorithms face scenarios where preference labels are incorrect, hinting at areas ripe for further exploration and innovation.

5. Future Scope and Considerations:

The research underscores inadequacies in existing methods highlighting needs for improvement in algorithmic structures to tolerate mislabeled data and enhance sampling efficiency. Future work could explore advanced techniques such as active and meta-learning within this framework. The issue of scale in handling visual inputs, sparse reward configurations, and cross-environment reward function generalization also pose intriguing challenges for practitioners in the field.

Conclusion

B-Pref offers a comprehensive benchmark and evaluation suite aimed at advancing preference-based RL's efficacy and robustness. This framework promises to support a deeper understanding of present limitations and guide future research pathways in developing highly effective preference-based RL systems. As the field evolves, embracing these types of benchmarks will be crucial for developing RL systems capable of tackling increasingly complex real-world tasks.