Papers
Topics
Authors
Recent
Search
2000 character limit reached

A Model-Based Approach for Improving Reinforcement Learning Efficiency Leveraging Expert Observations

Published 29 Feb 2024 in cs.LG | (2402.18836v1)

Abstract: This paper investigates how to incorporate expert observations (without explicit information on expert actions) into a deep reinforcement learning setting to improve sample efficiency. First, we formulate an augmented policy loss combining a maximum entropy reinforcement learning objective with a behavioral cloning loss that leverages a forward dynamics model. Then, we propose an algorithm that automatically adjusts the weights of each component in the augmented loss function. Experiments on a variety of continuous control tasks demonstrate that the proposed algorithm outperforms various benchmarks by effectively utilizing available expert observations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
  2. B. R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A. A. Al Sallab, S. Yogamani, and P. Pérez, “Deep reinforcement learning for autonomous driving: A survey,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 6, pp. 4909–4926, 2021.
  3. H. Ravichandar, A. S. Polydoros, S. Chernova, and A. Billard, “Recent advances in robot learning from demonstration,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 3, pp. 297–330, 2020.
  4. T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
  5. A. Abdolmaleki, J. T. Springenberg, Y. Tassa, R. Munos, N. Heess, and M. Riedmiller, “Maximum a posteriori policy optimisation,” arXiv preprint arXiv:1806.06920, 2018.
  6. J. Queeney, Y. Paschalidis, and C. G. Cassandras, “Generalized proximal policy optimization with sample reuse,” In Advances in Neural Information Processing Systems, vol. 34, pp. 11 909–11 919, 2021.
  7. J. Queeney, E. C. Ozcan, I. C. Paschalidis, and C. G. Cassandras, “Optimal transport perturbations for safe reinforcement learning with robustness guarantees,” arXiv preprint arXiv:2301.13375, 2023.
  8. Y. Liu, A. Gupta, P. Abbeel, and S. Levine, “Imitation from observation: Learning to imitate behaviors from raw video via context translation,” in 2018 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2018, pp. 1118–1125.
  9. V. Giammarino, J. Queeney, L. C. Carstensen, M. E. Hasselmo, and I. C. Paschalidis, “Opportunities and challenges from using animal videos in reinforcement learning for navigation,” IFAC-PapersOnLine, vol. 56, no. 2, pp. 9056–9061, 2023.
  10. R. F. Prudencio, M. R. Maximo, and E. L. Colombini, “A survey on offline reinforcement learning: Taxonomy, review, and open problems,” IEEE Transactions on Neural Networks and Learning Systems, 2023.
  11. B. Zheng, S. Verma, J. Zhou, I. Tsang, and F. Chen, “Imitation learning: Progress, taxonomies and challenges,” arXiv preprint arXiv:2106.12177, 2021.
  12. Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, et al., “Deepmind control suite,” arXiv preprint arXiv:1801.00690, 2018.
  13. A. Rajeswaran, V. Kumar, A. Gupta, G. Vezzani, J. Schulman, E. Todorov, and S. Levine, “Learning complex dexterous manipulation with deep reinforcement learning and demonstrations,” arXiv preprint arXiv:1709.10087, 2017.
  14. Y. Gao, H. Xu, J. Lin, F. Yu, S. Levine, and T. Darrell, “Reinforcement learning from imperfect demonstrations,” arXiv preprint arXiv:1802.05313, 2018.
  15. A. Nair, A. Gupta, M. Dalal, and S. Levine, “AWAC: Accelerating online reinforcement learning with offline datasets,” arXiv preprint arXiv:2006.09359, 2020.
  16. H. Xu, X. Zhan, J. Li, and H. Yin, “Offline reinforcement learning with soft behavior regularization,” arXiv preprint arXiv:2110.07395, 2021.
  17. C. Finn, S. Levine, and P. Abbeel, “Guided cost learning: Deep inverse optimal control via policy optimization,” In Proceedings of the 33rd International Conference on Machine Learning, pp. 49–58, 2016.
  18. J. Ho and S. Ermon, “Generative adversarial imitation learning,” In Advances in Neural Information Processing Systems, vol. 29, 2016.
  19. N. Baram, O. Anschel, and S. Mannor, “Model-based adversarial imitation learning,” arXiv preprint arXiv:1612.02179, 2016.
  20. F. Torabi, G. Warnell, and P. Stone, “Generative adversarial imitation from observation,” arXiv preprint arXiv:1807.06158, 2018.
  21. W. Sun, A. Vemula, B. Boots, and D. Bagnell, “Provably efficient imitation learning from observation alone,” In Proceedings of the 36th International Conference on Machine Learning, pp. 6036–6045, 2019.
  22. R. Kidambi, J. Chang, and W. Sun, “Mobile: Model-based imitation learning from observation alone,” In Advances in Neural Information Processing Systems, vol. 34, pp. 28 598–28 611, 2021.
  23. V. Giammarino, J. Queeney, and I. C. Paschalidis, “Adversarial imitation learning from visual observations using latent information,” arXiv preprint arXiv:2309.17371, 2023.
  24. M. Liu, T. He, W. Zhang, S. Yan, and Z. Xu, “Visual imitation learning with patch rewards,” arXiv preprint arXiv:2302.00965, 2023.
  25. A. Tucker, A. Gleave, and S. Russell, “Inverse reinforcement learning for video games,” arXiv preprint arXiv:1810.10593, 2018.
  26. D. Brown, W. Goo, P. Nagarajan, and S. Niekum, “Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations,” In Proceedings of the 36th International Conference on Machine Learning, pp. 783–792, 2019.
  27. S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, pp. 627–635, 2011.
  28. B. Kim and J. Pineau, “Maximum mean discrepancy imitation learning.” in Robotics: Science and Systems, 2013.
  29. M. K. Hanawal, H. Liu, H. Zhu, and I. C. Paschalidis, “Learning policies for Markov decision processes from data,” IEEE Transactions on Automatic Control, vol. 64, no. 6, pp. 2298–2309, June 2019.
  30. A. Edwards, H. Sahni, Y. Schroecker, and C. Isbell, “Imitating latent policies from observation,” In Proceedings of the 36th International Conference on Machine Learning, pp. 1755–1763, 2019.
  31. F. Torabi, G. Warnell, and P. Stone, “Behavioral cloning from observation,” arXiv preprint arXiv:1805.01954, 2018.
  32. M. Laskey, S. Staszak, W. Y.-S. Hsieh, J. Mahler, F. T. Pokorny, A. D. Dragan, and K. Goldberg, “SHIV: Reducing supervisor burden in dagger using support vectors for efficient learning from demonstrations in high dimensional state spaces,” In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 462–469, 2016.
  33. C.-A. Cheng, X. Yan, N. Wagener, and B. Boots, “Fast policy learning through imitation and reinforcement,” arXiv preprint arXiv:1805.10413, 2018.
  34. V. Giammarino, M. F. Dunne, K. N. Moore, M. E. Hasselmo, C. E. Stern, and I. C. Paschalidis, “Combining imitation and deep reinforcement learning to human-level performance on a virtual foraging task,” Adaptive Behavior, 2023.
  35. V. G. Goecks, G. M. Gremillion, V. J. Lawhern, J. Valasek, and N. R. Waytowich, “Integrating behavior cloning and reinforcement learning for improved performance in dense and sparse reward environments,” arXiv preprint arXiv:1910.04281, 2019.
  36. P. Agrawal, A. V. Nair, P. Abbeel, J. Malik, and S. Levine, “Learning to poke by poking: Experiential learning of intuitive physics,” In Advances in Neural Information Processing Systems, vol. 29, 2016.
  37. A. Nagabandi, K. Konolige, S. Levine, and V. Kumar, “Deep dynamics models for learning dexterous manipulation,” In Proceedings of the 4th Conference on Robot Learning, pp. 1101–1112, 2020.
  38. A. Rajeswaran, I. Mordatch, and V. Kumar, “A game theoretic framework for model based reinforcement learning,” In Proceedings of the 37th International Conference on Machine Learning, pp. 7953–7963, 2020.
  39. M. Janner, J. Fu, M. Zhang, and S. Levine, “When to trust your model: Model-based policy optimization,” In Advances in Neural Information Processing Systems, vol. 32, 2019.
  40. T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, et al., “Soft actor-critic algorithms and applications,” arXiv preprint arXiv:1812.05905, 2018.
  41. R. Kidambi, A. Rajeswaran, P. Netrapalli, and T. Joachims, “Morel: Model-based offline reinforcement learning,” In Advances in Neural Information Processing Systems, vol. 33, pp. 21 810–21 823, 2020.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.