Papers
Topics
Authors
Recent
2000 character limit reached

Reward-Punishment Reinforcement Learning with Maximum Entropy (2405.11784v1)

Published 20 May 2024 in cs.LG, cs.AI, and cs.RO

Abstract: We introduce the soft Deep MaxPain'' (softDMP) algorithm, which integrates the optimization of long-term policy entropy into reward-punishment reinforcement learning objectives. Our motivation is to facilitate a smoother variation of operators utilized in the updating of action values beyond traditionalmax'' and min'' operators, where the goal is enhancing sample efficiency and robustness. We also address two unresolved issues from the previous Deep MaxPain method. Firstly, we investigate how the negated (flipped'') pain-seeking sub-policy, derived from the punishment action value, collaborates with the min'' operator to effectively learn the punishment module and how softDMP's smooth learning operator provides insights into theflipping'' trick. Secondly, we tackle the challenge of data collection for learning the punishment module to mitigate inconsistencies arising from the involvement of the ``flipped'' sub-policy (pain-avoidance sub-policy) in the unified behavior policy. We empirically explore the first issue in two discrete Markov Decision Process (MDP) environments, elucidating the crucial advancements of the DMP approach and the necessity for soft treatments on the hard operators. For the second issue, we propose a probabilistic classifier based on the ratio of the pain-seeking sub-policy to the sum of the pain-seeking and goal-reaching sub-policies. This classifier assigns roll-outs to separate replay buffers for updating reward and punishment action-value functions, respectively. Our framework demonstrates superior performance in Turtlebot 3's maze navigation tasks under the ROS Gazebo simulation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. S. Elfwing and B. Seymour. Parallel reward and punishment control in humans and robots: safe reinforcement learning using the maxpain algorithm. In Proc. of the 7th Joint IEEE International Conference on Development and Learning and on Epigenetic Robotics, pages 140–7, 2017.
  2. Differential encoding of losses and gains in the human striatum. Journal of Neuroscience, 27(18):4826–31, 2007.
  3. Serotonin selectively modulates reward value in human decision-making. Journal of Neuroscience, 32(17):5833–42, 2012.
  4. Striatal structure and function predict individual biases in learning to avoid pain. Proceedings of the National Academy of Sciences of the United States of America, 113(17):4812–7, 2016.
  5. Deep reinforcement learning by parallelizing reward and punishment using the maxpain architecture. In Proc. of the 8th Joint IEEE International Conference on Development and Learning and on Epigenetic Robotics. IEEE, 2018.
  6. Modular deep reinforcement learning from reward and punishment for robot navigation. Neural Networks, 135:115–26, 2021.
  7. K. Asadi and M. L. Littman. An alternative softmax operator for reinforcement learning. In Proc. of the 34th International Conference on Machine Learning, 2017.
  8. Generative adversarial nets. In Advances in Neural Information Processing Systems 27, pages 2672–80, 2014.
  9. S. P. Singh. Transfer of learning by composing solutions of elemental sequential tasks. Machine Learning, 8(3-4):323–339, 1992.
  10. J. Karlsson. Learning to solve multiple goals. PhD thesis, University of Rochester, 1997.
  11. Hybrid reward architecture for reinforcement learning. In Advances in Neural Information Processing Systems 30, 2017.
  12. M. Humphrys. Action selection methods using reinforcement learning. In From Animals to Animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior, pages 135–144, 1996.
  13. Multiple model-based reinforcement learning. Neural Computation, 14(6):1347–1369, 2002.
  14. Two dimensional evaluation reinforcement learning. In International Work-Conference on Artificial Neural Networks, pages 370–377. Springer, 2001.
  15. R. Lowe and T. Ziemke. Exploring the relationship of reward and punishment in reinforcement learning. In Proc. of the 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pages 140–147. IEEE, 2013.
  16. Reward-punishment actor-critic algorithm applying to robotic non-grasping manipulation. In 2019 Joint IEEE 9th International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob), pages 37–42. IEEE, 2019.
  17. A story of two streams: Reinforcement learning models from human behavior and neuropsychiatry. In Proc. of the 19th International Conference on Autonomous Agents and Multi-Agent Systems, pages 744–752, 2020.
  18. Taming the noise in reinforcement learning via soft updates. In Proc. of the 32nd Conference on Uncertainty in Artificial Intelligence, 2016.
  19. Dynamic policy programming. Journal of Machine Learning Research, 13:3207–45, 2012.
  20. M. Toussaint. Robot trajectory optimization using approximate inference. In Proc. of the 26th International Conference on Machine Learning, pages 1049–56, 2009.
  21. Reinforcement learning with deep energy-based policies. In Proc. of the 34th International Conference on Machine Learning, pages 1352–61, 2017.
  22. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proc. of the 35th International Conference on Machine Learning, pages 1861–70, 2018.
  23. L. Sergey. Reinforcement learning and control as probabilistic inference: Tutorial and review. CoRR, abs/1805.00909, 2018.
  24. Theoretical analysis of efficiency and robustness of softmax and gap-increasing operators in reinforcement learning. In Proc. of the 22nd International Conference on Artificial Intelligence and Statistics, pages 2695–3003, 2019.
  25. B. Eysenbach and S. Levine. Maximum entropy rl (provably) solves some robust rl problems. In Proc. of the 10th International Conference on Learning and Representations, 2022.

Summary

We haven't generated a summary for this paper yet.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (2)

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.