Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Deep Reinforcement Learning from Hierarchical Preference Design (2309.02632v3)

Published 6 Sep 2023 in cs.LG and cs.AI

Abstract: Reward design is a fundamental, yet challenging aspect of reinforcement learning (RL). Researchers typically utilize feedback signals from the environment to handcraft a reward function, but this process is not always effective due to the varying scale and intricate dependencies of the feedback signals. This paper shows by exploiting certain structures, one can ease the reward design process. Specifically, we propose a hierarchical reward modeling framework -- HERON for scenarios: (I) The feedback signals naturally present hierarchy; (II) The reward is sparse, but with less important surrogate feedback to help policy learning. Both scenarios allow us to design a hierarchical decision tree induced by the importance ranking of the feedback signals to compare RL trajectories. With such preference data, we can then train a reward model for policy learning. We apply HERON to several RL applications, and we find that our framework can not only train high performing agents on a variety of difficult tasks, but also provide additional benefits such as improved sample efficiency and robustness. Our code is available at \url{https://github.com/abukharin3/HERON}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Deep direct reinforcement learning for financial signal representation and trading. IEEE transactions on neural networks and learning systems, 28(3):653–664, 2016.
  2. Deep reinforcement learning for intelligent transportation systems: A survey. IEEE Transactions on Intelligent Transportation Systems, 23(1):11–32, 2020.
  3. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35:21314–21328, 2022.
  4. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  5. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
  6. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
  7. The perils of trial-and-error reward design: misdesign through overfitting and invalid task specifications. In AAAI Conference on Artificial Intelligence, 2023.
  8. Integrating independent and centralized multi-agent reinforcement learning for traffic signal network optimization. arXiv preprint arXiv:1909.10651, 2019.
  9. Learning robust rewards with adversarial inverse reinforcement learning. arXiv preprint arXiv:1710.11248, 2017.
  10. Learning dense rewards for contact-rich manipulation tasks. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 6214–6221. IEEE, 2021.
  11. Elise Van der Pol and Frans A Oliehoek. Coordinated deep reinforcement learners for traffic light control. Proceedings of learning, inference and control of multi-agent systems (at NIPS 2016), 8:21–38, 2016.
  12. Flow: A modular learning framework for autonomy in traffic. arXiv preprint arXiv:1710.05465, 2017.
  13. Finrl: A deep reinforcement learning library for automated stock trading in quantitative finance. arXiv preprint arXiv:2011.09607, 2020.
  14. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  15. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  16. Pybullet, a python module for physics simulation for games, robotics and machine learning. 2016.
  17. Policy invariance under reward transformations: Theory and application to reward shaping. In Icml, volume 99, pages 278–287. Citeseer, 1999.
  18. Dynamic reward shaping: training a robot by voice. In Advances in Artificial Intelligence–IBERAMIA 2010: 12th Ibero-American Conference on AI, Bahía Blanca, Argentina, November 1-5, 2010. Proceedings 12, pages 483–492. Springer, 2010.
  19. Dynamic potential-based reward shaping. In Proceedings of the 11th international conference on autonomous agents and multiagent systems, pages 433–440. IFAAMAS, 2012.
  20. Bhaskara Marthi. Automatic shaping and decomposition of reward functions. In Proceedings of the 24th International Conference on Machine learning, pages 601–608, 2007.
  21. Learning potential for reward shaping in reinforcement learning with tile coding. In Proceedings AAMAS 2008 Workshop on Adaptive and Learning Agents and Multi-Agent Systems (ALAMAS-ALAg 2008), pages 17–23, 2008.
  22. Automatic successive reinforcement learning with multiple auxiliary rewards. In IJCAI, pages 2336–2342, 2019.
  23. Learning to utilize shaping rewards: A new approach of reward shaping. Advances in Neural Information Processing Systems, 33:15931–15941, 2020.
  24. Automated reinforcement learning: An overview. arXiv preprint arXiv:2201.05000, 2022.
  25. Automated reinforcement learning (autorl): A survey and open problems. Journal of Artificial Intelligence Research, 74:517–568, 2022.
  26. Evolving rewards to automate reinforcement learning. arXiv preprint arXiv:1905.07628, 2019.
  27. Learning navigation behaviors end-to-end with autorl. IEEE Robotics and Automation Letters, 4(2):2007–2014, 2019.
  28. Algorithms for inverse reinforcement learning. In Icml, volume 1, page 2, 2000.
  29. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, page 1, 2004.
  30. Relative entropy inverse reinforcement learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 182–189. JMLR Workshop and Conference Proceedings, 2011.
  31. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  32. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  33. Active preference-based learning of reward functions. 2017.
  34. Active preference-based gaussian process regression for reward learning. arXiv preprint arXiv:2005.02575, 2020.
  35. Reinforcement learning: An introduction. MIT press, 2018.
  36. Trust region policy optimization. In International conference on machine learning, pages 1889–1897. PMLR, 2015.
  37. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  38. Intellilight: A reinforcement learning approach for intelligent traffic light control. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2496–2505, 2018.
  39. Robust multi-agent reinforcement learning via adversarial regularization: Theoretical foundation and stable algorithms. Advances in Neural Information Processing Systems, 36, 2024.
  40. Execution-based code generation using deep reinforcement learning. arXiv preprint arXiv:2301.13816, 2023.
  41. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859, 2021.
  42. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  43. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938, 2021.
  44. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  45. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  46. Multi-objective reinforcement learning using sets of pareto dominating policies. The Journal of Machine Learning Research, 15(1):3483–3512, 2014.
  47. Multi-objective deep reinforcement learning. arXiv preprint arXiv:1610.02707, 2016.
  48. Multi-objectivization and ensembles of shapings in reinforcement learning. Neurocomputing, 263:48–59, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Alexander Bukharin (16 papers)
  2. Yixiao Li (14 papers)
  3. Pengcheng He (60 papers)
  4. Tuo Zhao (131 papers)