Skill Disentanglement for Imitation Learning from Suboptimal Demonstrations (2306.07919v1)
Abstract: Imitation learning has achieved great success in many sequential decision-making tasks, in which a neural agent is learned by imitating collected human demonstrations. However, existing algorithms typically require a large number of high-quality demonstrations that are difficult and expensive to collect. Usually, a trade-off needs to be made between demonstration quality and quantity in practice. Targeting this problem, in this work we consider the imitation of sub-optimal demonstrations, with both a small clean demonstration set and a large noisy set. Some pioneering works have been proposed, but they suffer from many limitations, e.g., assuming a demonstration to be of the same optimality throughout time steps and failing to provide any interpretation w.r.t knowledge learned from the noisy set. Addressing these problems, we propose {\method} by evaluating and imitating at the sub-demonstration level, encoding action primitives of varying quality into different skills. Concretely, {\method} consists of a high-level controller to discover skills and a skill-conditioned module to capture action-taking policies, and is trained following a two-phase pipeline by first discovering skills with all demonstrations and then adapting the controller to only the clean set. A mutual-information-based regularization and a dynamic sub-demonstration optimality estimator are designed to promote disentanglement in the skill space. Extensive experiments are conducted over two gym environments and a real-world healthcare dataset to demonstrate the superiority of {\method} in learning from sub-optimal demonstrations and its improved interpretability by examining learned skills.
- B. R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A. A. Al Sallab, S. Yogamani, and P. Pérez, “Deep reinforcement learning for autonomous driving: A survey,” IEEE Transactions on Intelligent Transportation Systems, 2021.
- J. Hemminghaus and S. Kopp, “Towards adaptive social behavior generation for assistive robots using reinforcement learning,” in Proceedings of the 2017 ACM/IEEE International Conference on Human-Robot Interaction, 2017, pp. 332–340.
- T. Zhao, L. Liu, G. Huang, H. Li, Y. Liu, L. GuiQuan, and S. Shi, “Balancing quality and human involvement: An effective approach to interactive neural machine translation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, 2020, pp. 9660–9667.
- G. Huang, L. Liu, X. Wang, L. Wang, H. Li, Z. Tu, C. Huang, and S. Shi, “Transmart: A practical interactive machine translation system,” arXiv preprint arXiv:2105.13072, 2021.
- D. Pomerleau, “An autonomous land vehicle in a neural network,” Advances in Neural Information Processing Systems, vol. 1, 1998.
- J. Fu, K. Luo, and S. Levine, “Learning robust rewards with adverserial inverse reinforcement learning,” in International Conference on Learning Representations, 2018.
- J. Ho and S. Ermon, “Generative adversarial imitation learning,” Advances in neural information processing systems, vol. 29, 2016.
- L. Pinto and A. Gupta, “Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours,” in 2016 IEEE international conference on robotics and automation (ICRA). IEEE, 2016, pp. 3406–3413.
- H. Xu, X. Zhan, H. Yin, and H. Qin, “Discriminator-weighted offline imitation learning from suboptimal demonstrations,” in International Conference on Machine Learning. PMLR, 2022, pp. 24 725–24 742.
- A. Mandlekar, J. Booher, M. Spero, A. Tung, A. Gupta, Y. Zhu, A. Garg, S. Savarese, and L. Fei-Fei, “Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2019, pp. 1048–1055.
- S. Zhang, Z. Cao, D. Sadigh, and Y. Sui, “Confidence-aware imitation learning from demonstrations with varying optimality,” Advances in Neural Information Processing Systems, vol. 34, 2021.
- M. Beliaev, A. Shih, S. Ermon, D. Sadigh, and R. Pedarsani, “Imitation learning by estimating expertise of demonstrators,” arXiv preprint arXiv:2202.01288, 2022.
- A. Sharma, S. Gu, S. Levine, V. Kumar, and K. Hausman, “Dynamics-aware unsupervised discovery of skills,” in International Conference on Learning Representations, 2019.
- Y. J. Ma, “From adversarial imitation learning to robust batch imitation learning,” Ph.D. dissertation, 2020.
- L. Wang, W. Yu, X. He, W. Cheng, M. R. Ren, W. Wang, B. Zong, H. Chen, and H. Zha, “Adversarial cooperative imitation learning for dynamic treatment regimes,” in Proceedings of The Web Conference 2020, 2020, pp. 1785–1795.
- B. D. Ziebart, A. L. Maas, J. A. Bagnell, A. K. Dey et al., “Maximum entropy inverse reinforcement learning.” in Aaai, vol. 8. Chicago, IL, USA, 2008, pp. 1433–1438.
- M. Yang, S. Levine, and O. Nachum, “Trail: Near-optimal imitation learning with suboptimal data,” in International Conference on Learning Representations, 2021.
- F. Sasaki and R. Yamashina, “Behavioral cloning from noisy demonstrations,” in International Conference on Learning Representations, 2020.
- H. Sugiyama, T. Meguro, and Y. Minami, “Preference-learning based inverse reinforcement learning for dialog control,” in Thirteenth Annual Conference of the International Speech Communication Association, 2012.
- C. Wirth, J. Fürnkranz, and G. Neumann, “Model-free preference-based reinforcement learning,” in Thirtieth AAAI Conference on Artificial Intelligence, 2016.
- Y.-H. Wu, N. Charoenphakdee, H. Bao, V. Tangkaratt, and M. Sugiyama, “Imitation learning from imperfect demonstration,” in International Conference on Machine Learning. PMLR, 2019, pp. 6818–6827.
- P. Dayan and G. E. Hinton, “Feudal reinforcement learning,” Advances in neural information processing systems, vol. 5, 1992.
- R. S. Sutton, D. Precup, and S. Singh, “Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning,” Artificial intelligence, vol. 112, no. 1-2, pp. 181–211, 1999.
- A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu, “Feudal networks for hierarchical reinforcement learning,” in International Conference on Machine Learning. PMLR, 2017, pp. 3540–3549.
- P.-L. Bacon, J. Harb, and D. Precup, “The option-critic architecture,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1, 2017.
- B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine, “Diversity is all you need: Learning skills without a reward function,” arXiv preprint arXiv:1802.06070, 2018.
- V. Campos, A. Trott, C. Xiong, R. Socher, X. Giró-i Nieto, and J. Torres, “Explore, discover and learn: Unsupervised discovery of state-covering skills,” in International Conference on Machine Learning. PMLR, 2020, pp. 1317–1327.
- M. Klissarov and D. Precup, “Flexible option learning,” Advances in Neural Information Processing Systems, vol. 34, 2021.
- V. Veeriah, T. Zahavy, M. Hessel, Z. Xu, J. Oh, I. Kemaev, H. P. van Hasselt, D. Silver, and S. Singh, “Discovery of options via meta-learned subgoals,” Advances in Neural Information Processing Systems, vol. 34, 2021.
- R. Parr and S. Russell, “Reinforcement learning with hierarchies of machines,” Advances in neural information processing systems, vol. 10, 1997.
- W. Ren, P. Wang, X. Li, C. E. Hughes, and Y. Fu, “Semi-supervised drifted stream learning with short lookback,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 1504–1513.
- J. Gehring, G. Synnaeve, A. Krause, and N. Usunier, “Hierarchical skills for efficient exploration,” Advances in Neural Information Processing Systems, vol. 34, pp. 11 553–11 564, 2021.
- N. K. Jong, T. Hester, and P. Stone, “The utility of temporal abstraction in reinforcement learning.” in AAMAS (1). Citeseer, 2008, pp. 299–306.
- K. Hakhamaneshi, R. Zhao, A. Zhan, P. Abbeel, and M. Laskin, “Hierarchical few-shot imitation with skill transition models,” in International Conference on Learning Representations, 2021.
- E. Jang, S. Gu, and B. Poole, “Categorical reparametrization with gumble-softmax,” in International Conference on Learning Representations (ICLR 2017). OpenReview. net, 2017.
- O. Celik, D. Zhou, G. Li, P. Becker, and G. Neumann, “Specializing versatile skill libraries using local mixture of experts,” in Conference on Robot Learning. PMLR, 2022, pp. 1423–1433.
- M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville, and D. Hjelm, “Mutual information neural estimation,” in International conference on machine learning. PMLR, 2018, pp. 531–540.
- R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio, “Learning deep representations by mutual information estimation and maximization,” in International Conference on Learning Representations, 2018.
- T. Zhao, D. Luo, X. Zhang, and S. Wang, “Towards faithful and consistent explanations for graph neural networks,” in Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, 2023, pp. 634–642.
- J. Bekker and J. Davis, “Learning from positive and unlabeled data: A survey,” Machine Learning, vol. 109, no. 4, pp. 719–760, 2020.
- T. Zhao, X. Zhang, and S. Wang, “Exploring edge disentanglement for node classification,” in Proceedings of the ACM Web Conference 2022, 2022, pp. 1028–1036.
- M. Chevalier-Boisvert, L. Willems, and S. Pal, “Minimalistic gridworld environment for openai gym,” https://github.com/maximecb/gym-minigrid, 2018.
- J. M. Bajor and T. A. Lasko, “Predicting medications from diagnostic codes with recurrent neural networks,” 2016.
- S. Wang, S. Toyer, A. Gleave, and S. Emmons, “The imitation library for imitation learning and inverse reinforcement learning,” https://github.com/HumanCompatibleAI/imitation, 2020.
- B. D. Ziebart, J. A. Bagnell, and A. K. Dey, “Modeling interaction via the principle of maximum causal entropy,” in ICML, 2010.
- A. P. Bradley, “The use of the area under the roc curve in the evaluation of machine learning algorithms,” Pattern recognition, vol. 30, no. 7, pp. 1145–1159, 1997.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR (Poster), 2015.
- L. Wang, R. Tang, X. He, and X. He, “Hierarchical imitation learning via subgoal representation learning for dynamic treatment recommendation,” in Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, 2022, pp. 1081–1089.
- M. Singer, C. S. Deutschman, C. W. Seymour, M. Shankar-Hari, D. Annane, M. Bauer, R. Bellomo, G. R. Bernard, J.-D. Chiche, C. M. Coopersmith et al., “The third international consensus definitions for sepsis and septic shock (sepsis-3),” Jama, vol. 315, no. 8, pp. 801–810, 2016.
- Tianxiang Zhao (26 papers)
- Wenchao Yu (23 papers)
- Suhang Wang (118 papers)
- Lu Wang (329 papers)
- Xiang Zhang (395 papers)
- Yuncong Chen (13 papers)
- Yanchi Liu (41 papers)
- Wei Cheng (175 papers)
- Haifeng Chen (99 papers)