Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Barrier Functions Inspired Reward Shaping for Reinforcement Learning (2403.01410v2)

Published 3 Mar 2024 in cs.RO

Abstract: Reinforcement Learning (RL) has progressed from simple control tasks to complex real-world challenges with large state spaces. While RL excels in these tasks, training time remains a limitation. Reward shaping is a popular solution, but existing methods often rely on value functions, which face scalability issues. This paper presents a novel safety-oriented reward-shaping framework inspired by barrier functions, offering simplicity and ease of implementation across various environments and tasks. To evaluate the effectiveness of the proposed reward formulations, we conduct simulation experiments on CartPole, Ant, and Humanoid environments, along with real-world deployment on the Unitree Go1 quadruped robot. Our results demonstrate that our method leads to 1.4-2.8 times faster convergence and as low as 50-60% actuation effort compared to the vanilla reward. In a sim-to-real experiment with the Go1 robot, we demonstrated better control and dynamics of the bot with our reward framework.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap, “Mastering diverse domains through world models,” arXiv preprint arXiv:2301.04104, 2023.
  2. V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015.
  3. A. Glaese, N. McAleese, M. Trębacz, J. Aslanides, V. Firoiu, T. Ewalds, M. Rauh, L. Weidinger, M. Chadwick, P. Thacker et al., “Improving alignment of dialogue agents via targeted human judgements,” arXiv preprint arXiv:2209.14375, 2022.
  4. L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, 2022.
  5. S. Arimoto, S. Kawamura, and F. Miyazaki, “Bettering operation of robots by learning,” Journal of Robotic systems, vol. 1, no. 2, pp. 123–140, 1984.
  6. J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013.
  7. P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” Advances in neural information processing systems, vol. 30, 2017.
  8. B. Marthi, “Automatic shaping and decomposition of reward functions,” in Proceedings of the 24th International Conference on Machine Learning, ser. ICML ’07.   New York, NY, USA: Association for Computing Machinery, 2007, p. 601–608.
  9. M. Grzes and D. Kudenko, “Plan-based reward shaping for reinforcement learning,” in 2008 4th International IEEE Conference Intelligent Systems, vol. 2, 2008, pp. 10–22–10–29.
  10. A. Y. Ng, D. Harada, and S. Russell, “Policy invariance under reward transformations: Theory and application to reward shaping,” in Icml, vol. 99.   Citeseer, 1999, pp. 278–287.
  11. Y. Dong, X. Tang, and Y. Yuan, “Principled reward shaping for reinforcement learning via lyapunov stability theory,” Neurocomputing, vol. 393, pp. 83–90, 2020.
  12. A. G. Barto and S. Mahadevan, “Recent advances in hierarchical reinforcement learning,” Discrete event dynamic systems, vol. 13, no. 1-2, pp. 41–77, 2003.
  13. R. Bellman, “Dynamic programming,” science, vol. 153, no. 3731, pp. 34–37, 1966.
  14. J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,” arXiv preprint arXiv:1506.02438, 2015.
  15. Y. Tassa, T. Erez, and E. Todorov, “Synthesis and stabilization of complex behaviors through online trajectory optimization,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.   IEEE, 2012, pp. 4906–4913.
  16. Unitree, “Go1,” 2022. [Online]. Available: https://www.unitree.com/products/go1
  17. M. Grzes and D. Kudenko, “Plan-based reward shaping for reinforcement learning,” in 2008 4th International IEEE Conference Intelligent Systems, vol. 2.   IEEE, 2008, pp. 10–22.
  18. S. M. Devlin and D. Kudenko, “Dynamic potential-based reward shaping,” in Proceedings of the 11th international conference on autonomous agents and multiagent systems.   IFAAMAS, 2012, pp. 433–440.
  19. M. Grzes, “Reward shaping in episodic reinforcement learning,” 2017.
  20. O. Marom and B. Rosman, “Belief reward shaping in reinforcement learning,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018.
  21. A. Balakrishnan and J. V. Deshmukh, “Structured reward shaping using signal temporal logic specifications,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2019, pp. 3481–3486.
  22. N. Saxena, G. Sandeep, and P. Jagtap, “Funnel-based reward shaping for signal temporal logic tasks in reinforcement learning,” IEEE Robotics and Automation Letters, 2023.
  23. R. Cheng, G. Orosz, R. M. Murray, and J. W. Burdick, “End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, pp. 3387–3395, Jul. 2019.
  24. D. Yu, H. Ma, S. E. Li, and J. Chen, “Reachability constrained reinforcement learning,” 2022.
  25. H. Zhao, X. Zeng, T. Chen, Z. Liu, and J. Woodcock, “Learning safe neural network controllers with barrier certificates,” 2020.
  26. B. Marthi, “Automatic shaping and decomposition of reward functions,” in Proceedings of the 24th International Conference on Machine learning, 2007, pp. 601–608.
  27. D. C. Tan, F. Acero, R. McCarthy, D. Kanoulas, and Z. A. Li, “Your value function is a control barrier function: Verification of learned policies using control theory,” arXiv preprint arXiv:2306.04026, 2023.
  28. A. D. Ames, S. Coogan, M. Egerstedt, G. Notomista, K. Sreenath, and P. Tabuada, “Control barrier functions: Theory and applications,” in 2019 18th European control conference (ECC).   IEEE, 2019, pp. 3420–3431.
  29. A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptive elements that can solve difficult learning control problems,” IEEE transactions on systems, man, and cybernetics, no. 5, pp. 834–846, 1983.
  30. P. Wawrzyński, “A cat-like robot real-time learning to run,” in Adaptive and Natural Computing Algorithms: 9th International Conference, ICANNGA 2009, Kuopio, Finland, April 23-25, 2009, Revised Selected Papers 9.   Springer, 2009, pp. 380–390.
  31. S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” in International conference on machine learning.   PMLR, 2018, pp. 1587–1596.
  32. G. B. Margolis and P. Agrawal, “Walk these ways: Tuning robot control for generalization with multiplicity of behavior,” Conference on Robot Learning, 2022.
  33. V. Makoviychuk, L. Wawrzyniak, Y. Guo, M. Lu, K. Storey, M. Macklin, D. Hoeller, N. Rudin, A. Allshire, A. Handa et al., “Isaac gym: High performance gpu-based physics simulation for robot learning,” arXiv preprint arXiv:2108.10470, 2021.
  34. J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Abhishek Ranjan (4 papers)
  2. Shreenabh Agrawal (2 papers)
  3. Aayush Jain (10 papers)
  4. Pushpak Jagtap (49 papers)
  5. Shishir Kolathaya (42 papers)
  6. Nilaksh Nilaksh (2 papers)
Citations (4)