Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Variational Dynamic for Self-Supervised Exploration in Deep Reinforcement Learning (2010.08755v3)

Published 17 Oct 2020 in cs.LG, cs.CV, and cs.RO

Abstract: Efficient exploration remains a challenging problem in reinforcement learning, especially for tasks where extrinsic rewards from environments are sparse or even totally disregarded. Significant advances based on intrinsic motivation show promising results in simple environments but often get stuck in environments with multimodal and stochastic dynamics. In this work, we propose a variational dynamic model based on the conditional variational inference to model the multimodality and stochasticity. We consider the environmental state-action transition as a conditional generative process by generating the next-state prediction under the condition of the current state, action, and latent variable, which provides a better understanding of the dynamics and leads a better performance in exploration. We derive an upper bound of the negative log-likelihood of the environmental transition and use such an upper bound as the intrinsic reward for exploration, which allows the agent to learn skills by self-supervised exploration without observing extrinsic rewards. We evaluate the proposed method on several image-based simulation tasks and a real robotic manipulating task. Our method outperforms several state-of-the-art environment model-based exploration approaches.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, pp. 529–533, 2015.
  2. Z. Ren, D. Dong, H. Li, and C. Chen, “Self-paced prioritized curriculum learning with coverage penalty in deep reinforcement learning,” IEEE transactions on neural networks and learning systems, vol. 29, no. 6, pp. 2216–2226, 2018.
  3. D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. R. Baker, M. Lai, A. Bolton, Y. Chen, T. P. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, “Mastering the game of go without human knowledge,” Nature, vol. 550, pp. 354–359, 2017.
  4. O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev et al., “Grandmaster level in starcraft ii using multi-agent reinforcement learning,” Nature, pp. 1–5, 2019.
  5. P. Liu, C. Bai, Y. Zhao, C. Bai, W. Zhao, and X. Tang, “Generating attentive goals for prioritized hindsight reinforcement learning,” Knowledge-Based Systems, vol. 203, p. 106140, 2020.
  6. Z. Yang, K. Merrick, L. Jin, and H. A. Abbass, “Hierarchical deep reinforcement learning for continuous action control,” IEEE transactions on neural networks and learning systems, vol. 29, no. 11, pp. 5174–5184, 2018.
  7. C. Bai, L. Wang, Y. Wang, Z. Wang, R. Zhao, C. Bai, and P. Liu, “Addressing hindsight bias in multigoal reinforcement learning,” IEEE Transactions on Cybernetics, pp. 1–14, 2021.
  8. R. M. Ryan and E. L. Deci, “Intrinsic and extrinsic motivations: Classic definitions and new directions,” Contemporary educational psychology, vol. 25, no. 1, pp. 54–67, 2000.
  9. D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-driven exploration by self-supervised prediction,” in Proceedings of International Conference on Machine Learning, 2017, pp. 2778–2787.
  10. Y. Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and A. A. Efros, “Large-scale study of curiosity-driven learning,” in Proceedings of International Conference on Learning Representations, 2019, pp. 1–14.
  11. K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep reinforcement learning in a handful of trials using probabilistic dynamics models,” in Proceedings of Advances in Neural Information Processing Systems, 2018, pp. 4754–4765.
  12. D. Pathak, D. Gandhi, and A. Gupta, “Self-supervised exploration via disagreement,” in Proceedings of International Conference on Machine Learning, 2019, pp. 5062–5071.
  13. D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in Proceedings of International Conference on Learning Representations, 2014, pp. 1–14.
  14. K. Sohn, H. Lee, and X. Yan, “Learning structured output representation using deep conditional generative models,” in Proceedings of Advances in neural information processing systems, 2015, pp. 3483–3491.
  15. R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Proceedings of Advances in neural information processing systems, 2000, pp. 1057–1063.
  16. J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in Proceedings of International conference on machine learning, 2015, pp. 1889–1897.
  17. J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
  18. J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,” in Proceedings of International Conference on Learning Representations, 2016, pp. 1–14.
  19. G. Ostrovski, M. G. Bellemare, A. van den Oord, and R. Munos, “Count-based exploration with neural density models,” in Proceedings of International Conference on Machine Learning, 2017, pp. 2721–2730.
  20. M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos, “Unifying count-based exploration and intrinsic motivation,” in Proceedings of Advances in Neural Information Processing Systems, 2016, pp. 1471–1479.
  21. N. Savinov, A. Raichuk, R. Marinier, D. Vincent, M. Pollefeys, T. Lillicrap, and S. Gelly, “Episodic curiosity through reachability,” in Proceedings of International Conference on Learning Representations, 2019, pp. 1–14.
  22. Y. Burda, H. Edwards, A. Storkey, and O. Klimov, “Exploration by random network distillation,” in Proceedings of International Conference on Learning Representations, 2019, pp. 1–14.
  23. A. P. Badia, P. Sprechmann, A. Vitvitskyi, D. Guo, B. Piot, S. Kapturowski, O. Tieleman, M. Arjovsky, A. Pritzel, A. Bolt, and C. Blundell, “Never give up: Learning directed exploration strategies,” in Proceedings of International Conference on Learning Representations, 2020, pp. 1–14.
  24. N. Haber, D. Mrowca, S. Wang, L. F. Fei-Fei, and D. L. Yamins, “Learning to play with intrinsically-motivated, self-aware agents,” in Proceedings of Advances in Neural Information Processing Systems, 2018, pp. 8388–8399.
  25. P. Shyam, W. Jaśkowski, and F. Gomez, “Model-based active exploration,” in Proceedings of International Conference on Machine Learning, 2019, pp. 5779–5788.
  26. C. Bai, L. Wang, L. Han, J. Hao, A. Garg, P. Liu, and Z. Wang, “Principled exploration via optimistic bootstrapping and backward induction,” in Proceedings of International Conference on Machine Learning, vol. 139.   PMLR, 2021, pp. 577–587.
  27. J. Zhang, N. Wetzel, N. Dorka, J. Boedecker, and W. Burgard, “Scheduled intrinsic drive: A hierarchical take on intrinsically motivated exploration,” arXiv preprint arXiv:1903.07400, 2019.
  28. L. Beyer, D. Vincent, O. Teboul, S. Gelly, M. Geist, and O. Pietquin, “Mulex: Disentangling exploitation from exploration in deep rl,” in Proceedings of Exploration in RL Workshop of International Conference on Machine Learning, 2019, pp. 1–14.
  29. H. Kim, J. Kim, Y. Jeong, S. Levine, and H. O. Song, “Emi: Exploration with mutual information,” in Proceedings of International Conference on Machine Learning, 2019, pp. 3360–3369.
  30. Y. Kim, W. Nam, H. Kim, J.-H. Kim, and G. Kim, “Curiosity-bottleneck: Exploration by distilling task-specific novelty,” in Proceedings of International Conference on Machine Learning.   PMLR, 2019, pp. 3379–3388.
  31. C. Bai, L. Wang, L. Han, A. Garg, J. Hao, P. Liu, and Z. Wang, “Dynamic bottleneck for robust self-supervised exploration,” in Proceedings of Advances in Neural Information Processing Systems, 2021, pp. 1–14.
  32. J. Choi, Y. Guo, M. Moczulski, J. Oh, N. Wu, M. Norouzi, and H. Lee, “Contingency-aware exploration in reinforcement learning,” in Proceedings of International Conference of Learning Representation, 2019, pp. 1–14.
  33. D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, “Variational inference: A review for statisticians,” Journal of the American statistical Association, vol. 112, no. 518, pp. 859–877, 2017.
  34. M. Toussaint and A. Storkey, “Probabilistic inference for solving discrete and continuous state markov decision processes,” in Proceedings of International Conference on Machine Learning, 2006, pp. 945–952.
  35. B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey, “Maximum entropy inverse reinforcement learning.” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 8.   Chicago, IL, USA, 2008, pp. 1433–1438.
  36. L. Chen, L. Wang, Z. Han, J. Zhao, and W. Wang, “Variational inference based kernel dynamic bayesian networks for construction of prediction intervals for industrial time series with incomplete input,” IEEE/CAA Journal of Automatica Sinica, vol. 7, no. 5, pp. 1437–1445, 2019.
  37. P. Dayan and G. E. Hinton, “Using expectation-maximization for reinforcement learning,” Neural Computation, vol. 9, no. 2, pp. 271–278, 1997.
  38. J. Peters and S. Schaal, “Reinforcement learning by reward-weighted regression for operational space control,” in Proceedings of International Conference on Machine Learning, 2007, pp. 745–750.
  39. M. Fellows, A. Mahajan, T. G. Rudner, and S. Whiteson, “Virel: A variational inference framework for reinforcement learning,” Advances in Neural Information Processing Systems, vol. 32, pp. 7122–7136, 2019.
  40. D. P. Bertsekas, “Feature-based aggregation and deep reinforcement learning: A survey and some new implementations,” IEEE/CAA Journal of Automatica Sinica, vol. 6, no. 1, pp. 1–31, 2018.
  41. X. Sun and B. Bischl, “Tutorial and survey on probabilistic graphical model and variational inference in deep reinforcement learning,” in Proceedings of 2019 IEEE Symposium Series on Computational Intelligence (SSCI).   IEEE, 2019, pp. 110–119.
  42. R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. De Turck, and P. Abbeel, “Vime: Variational information maximizing exploration,” in Proceedings of Advances in Neural Information Processing Systems, 2016, pp. 1109–1117.
  43. D. Corneil, W. Gerstner, and J. Brea, “Efficient model-based deep reinforcement learning with variational state tabulation,” in Proceedings of International Conference on Machine Learning.   PMLR, 2018, pp. 1049–1058.
  44. P. Schulz, W. Aziz, and T. Cohn, “A stochastic decoder for neural machine translation,” in The 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 1243–1252.
  45. O. Ivanov, M. Figurnov, and D. Vetrov, “Variational autoencoder with arbitrary conditioning,” in Proceedings of International Conference on Learning Representations, 2019, pp. 1–14.
  46. K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep reinforcement learning in a handful of trials using probabilistic dynamics models,” in Proceedings of Advances in Neural Information Processing Systems, 2018, pp. 4759–4770.
  47. M. C. Machado, M. G. Bellemare, E. Talvitie, J. Veness, M. Hausknecht, and M. Bowling, “Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents,” Journal of Artificial Intelligence Research, vol. 61, pp. 523–562, 2018.
  48. R. Sekar, O. Rybkin, K. Daniilidis, P. Abbeel, D. Hafner, and D. Pathak, “Planning to explore via self-supervised world models,” in Proceedings of International Conference on Machine Learning, 2020, pp. 8583–8592.
  49. T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of gans for improved quality, stability, and variation,” in Proceedings of International Conference of Learning Representation, 2018, pp. 1–14.
  50. H. Thanh-Tung, T. Tran, and S. Venkatesh, “Improving generalization and stability of generative adversarial networks,” in Proceedings of International Conference of Learning Representation, 2019, pp. 1–14.
  51. S. Gao, M. Zhou, Y. Wang, J. Cheng, H. Yachi, and J. Wang, “Dendritic neuron model with effective learning algorithms for classification, approximation, and prediction,” IEEE transactions on neural networks and learning systems, vol. 30, no. 2, pp. 601–614, 2018.
  52. J. Wang and T. Kumbasar, “Parameter optimization of interval type-2 fuzzy neural networks based on pso and bbbc methods,” IEEE/CAA Journal of Automatica Sinica, vol. 6, no. 1, pp. 247–257, 2019.
  53. P. Zhang, S. Shu, and M. Zhou, “An online fault detection method based on svm-grid for cloud computing systems,” IEEE/CAA J. Automat. Sinica, vol. 5, no. 2, pp. 445–456, 2018.
Citations (14)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com