Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

STARC: A General Framework For Quantifying Differences Between Reward Functions (2309.15257v3)

Published 26 Sep 2023 in cs.LG and cs.AI

Abstract: In order to solve a task using reinforcement learning, it is necessary to first formalise the goal of that task as a reward function. However, for many real-world tasks, it is very difficult to manually specify a reward function that never incentivises undesirable behaviour. As a result, it is increasingly popular to use reward learning algorithms, which attempt to learn a reward function from data. However, the theoretical foundations of reward learning are not yet well-developed. In particular, it is typically not known when a given reward learning algorithm with high probability will learn a reward function that is safe to optimise. This means that reward learning algorithms generally must be evaluated empirically, which is expensive, and that their failure modes are difficult to anticipate in advance. One of the roadblocks to deriving better theoretical guarantees is the lack of good methods for quantifying the difference between reward functions. In this paper we provide a solution to this problem, in the form of a class of pseudometrics on the space of all reward functions that we call STARC (STAndardised Reward Comparison) metrics. We show that STARC metrics induce both an upper and a lower bound on worst-case regret, which implies that our metrics are tight, and that any metric with the same properties must be bilipschitz equivalent to ours. Moreover, we also identify a number of issues with reward metrics proposed by earlier works. Finally, we evaluate our metrics empirically, to demonstrate their practical efficacy. STARC metrics can be used to make both theoretical and empirical analysis of reward learning algorithms both easier and more principled.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)
  1. Deep reinforcement learning from human preferences, 2017.
  2. Reinforcement learning with a corrupted reward channel. CoRR, abs/1705.08417, 2017. URL http://arxiv.org/abs/1705.08417.
  3. Quantifying differences in reward functions, 2020. URL https://arxiv.org/abs/2006.13900.
  4. Reward learning from human preferences and demonstrations in Atari. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, volume 31, pp.  8022–8034, Montréal, Canada, 2018. Curran Associates, Inc., Red Hook, NY, USA.
  5. Calculus on MDPs: Potential shaping as a gradient, 2022. URL https://arxiv.org/abs/2208.09570.
  6. Algorithms for inverse reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning, volume 1, pp.  663–670, Stanford, California, USA, 2000. Morgan Kaufmann Publishers Inc.
  7. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the Sixteenth International Conference on Machine Learning, pp.  278–287, Bled, Slovenia, 1999. Morgan Kaufmann Publishers Inc.
  8. The effects of reward misspecification: Mapping and mitigating misaligned models, 2022. URL https://arxiv.org/abs/2201.03544.
  9. Artificial Intelligence: A Modern Approach. Pearson, 4 edition, 2020.
  10. Misspecification in inverse reinforcement learning, 2023.
  11. Invariance in policy optimisation and partial identifiability in reward learning. arXiv preprint arXiv:2203.07475, 2022a.
  12. Defining and characterizing reward hacking. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2022b.
  13. Defining and characterizing reward hacking, 2022c.
  14. Reinforcement Learning: An Introduction. MIT Press, second edition, 2018. ISBN 9780262352703.
  15. Dynamics-aware comparison of learned reward functions. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=CALFyKVs87.
  16. Consequences of misaligned AI. CoRR, abs/2102.03896, 2021. URL https://arxiv.org/abs/2102.03896.
Citations (7)

Summary

We haven't generated a summary for this paper yet.