Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
92 tokens/sec
Gemini 2.5 Pro Premium
51 tokens/sec
GPT-5 Medium
24 tokens/sec
GPT-5 High Premium
17 tokens/sec
GPT-4o
97 tokens/sec
DeepSeek R1 via Azure Premium
92 tokens/sec
GPT OSS 120B via Groq Premium
458 tokens/sec
Kimi K2 via Groq Premium
222 tokens/sec
2000 character limit reached

The Distributional Reward Critic Framework for Reinforcement Learning Under Perturbed Rewards (2401.05710v3)

Published 11 Jan 2024 in cs.LG

Abstract: The reward signal plays a central role in defining the desired behaviors of agents in reinforcement learning (RL). Rewards collected from realistic environments could be perturbed, corrupted, or noisy due to an adversary, sensor error, or because they come from subjective human feedback. Thus, it is important to construct agents that can learn under such rewards. Existing methodologies for this problem make strong assumptions, including that the perturbation is known in advance, clean rewards are accessible, or that the perturbation preserves the optimal policy. We study a new, more general, class of unknown perturbations, and introduce a distributional reward critic framework for estimating reward distributions and perturbations during training. Our proposed methods are compatible with any RL algorithm. Despite their increased generality, we show that they achieve comparable or better rewards than existing methods in a variety of environments, including those with clean rewards. Under the challenging and generalized perturbations we study, we win/tie the highest return in 44/48 tested settings (compared to 11/48 for the best baseline). Our results broaden and deepen our ability to perform RL in reward-perturbed environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  2. Vulnerability of deep reinforcement learning to policy induction attacks. In Machine Learning and Data Mining in Pattern Recognition: 13th International Conference, MLDM 2017, New York, NY, USA, July 15-20, 2017, Proceedings 13, pp.  262–275. Springer, 2017.
  3. A distributional perspective on reinforcement learning. In International conference on machine learning, pp.  449–458. PMLR, 2017.
  4. Provably robust blackbox optimization for reinforcement learning. In Conference on Robot Learning, pp.  683–696. PMLR, 2020.
  5. Reinforcement learning with stochastic reward machines. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  6429–6436, 2022.
  6. Implicit quantile networks for distributional reinforcement learning. In International conference on machine learning, pp.  1096–1105. PMLR, 2018a.
  7. Distributional reinforcement learning with quantile regression. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018b.
  8. Reinforcement learning with a corrupted reward channel. arXiv preprint arXiv:1705.08417, 2017.
  9. Robust loss functions under label noise for deep neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017.
  10. Inverse reward design. Advances in neural information processing systems, 30, 2017.
  11. Adversarial attacks on neural network policies. arXiv preprint arXiv:1702.02284, 2017.
  12. Delving into adversarial attacks on deep policies. arXiv preprint arXiv:1705.06452, 2017.
  13. Mobile robot navigation using prioritized experience replay q-learning. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.  2036–2041. IEEE, 2019.
  14. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  15. Tactics of adversarial attack on deep reinforcement learning agents. arXiv preprint arXiv:1703.06748, 2017a.
  16. Detecting adversarial attacks on neural network policies with visual foresight. arXiv preprint arXiv:1710.00814, 2017b.
  17. The effects of memory replay in reinforcement learning. In 2018 56th annual allerton conference on communication, control, and computing (Allerton), pp.  478–485. IEEE, 2018.
  18. Robust training under label noise by over-parameterization. In International Conference on Machine Learning, pp.  14153–14172. PMLR, 2022.
  19. Normalized loss functions for deep learning with noisy labels. In International conference on machine learning, pp.  6543–6553. PMLR, 2020.
  20. Ddsketch: A fast and fully-mergeable quantile sketch with relative-error guarantees. arXiv preprint arXiv:1908.10693, 2019.
  21. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  22. Noisy reinforcements in reinforcement learning: some case studies based on gridworlds. In Proceedings of the 6th WSEAS international conference on applied computer science, pp.  296–300, 2006.
  23. Policy invariance under reward transformations: Theory and application to reward shaping. In Icml, volume 99, pp.  278–287. Citeseer, 1999.
  24. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  25. Robust deep reinforcement learning with adversarial attacks. arXiv preprint arXiv:1712.03632, 2017.
  26. Robust adversarial reinforcement learning. In International Conference on Machine Learning, pp.  2817–2826. PMLR, 2017.
  27. Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
  28. Learning rewards to optimize global performance metrics in deep reinforcement learning. arXiv preprint arXiv:2303.09027, 2023.
  29. Policy teaching via environment poisoning: Training-time adversarial attacks against reinforcement learning. In International Conference on Machine Learning, pp.  7974–7984. PMLR, 2020.
  30. Reward estimation for variance reduction in deep reinforcement learning. In Proceedings of The 2nd Conference on Robot Learning, 2018.
  31. An analysis of categorical distributional reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pp.  29–37. PMLR, 2018.
  32. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  33. Reward is enough. Artificial Intelligence, 299:103535, 2021.
  34. Learning from noisy labels with deep neural networks: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  35. Regression as classification: Influence of task formulation on neural network features. In International Conference on Artificial Intelligence and Statistics, pp.  11563–11582. PMLR, 2023.
  36. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pp.  5026–5033. IEEE, 2012.
  37. Reinforcement learning with perturbed rewards. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  6202–6209, 2020.
  38. Generalized cross entropy loss for training deep neural networks with noisy labels. Advances in neural information processing systems, 31, 2018.
  39. Robust bayesian inverse reinforcement learning with sparse behavior noise. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 28, 2014.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets