Hybrid Inverse Reinforcement Learning (2402.08848v2)
Abstract: The inverse reinforcement learning approach to imitation learning is a double-edged sword. On the one hand, it can enable learning from a smaller number of expert demonstrations with more robustness to error compounding than behavioral cloning approaches. On the other hand, it requires that the learner repeatedly solve a computationally expensive reinforcement learning (RL) problem. Often, much of this computation is wasted searching over policies very dissimilar to the expert's. In this work, we propose using hybrid RL -- training on a mixture of online and expert data -- to curtail unnecessary exploration. Intuitively, the expert data focuses the learner on good states during training, which reduces the amount of exploration required to compute a strong policy. Notably, such an approach doesn't need the ability to reset the learner to arbitrary states in the environment, a requirement of prior work in efficient inverse RL. More formally, we derive a reduction from inverse RL to expert-competitive RL (rather than globally optimal RL) that allows us to dramatically reduce interaction during the inner policy search loop while maintaining the benefits of the IRL approach. This allows us to derive both model-free and model-based hybrid inverse RL algorithms with strong policy performance guarantees. Empirically, we find that our approaches are significantly more sample efficient than standard inverse RL and several other baselines on a suite of continuous control tasks.
- Policy search by dynamic programming. In NIPS, 2003.
- Efficient online reinforcement learning with offline data. arXiv preprint arXiv:2302.02948, 2023.
- Massively scalable inverse reinforcement learning in google maps, 2023.
- Openai gym, 2016.
- Hierarchical model-based imitation learning for planning in autonomous driving. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 8652–8659. IEEE, 2022.
- Sequencematch: Imitation learning for autoregressive sequence modelling with backtracking. arXiv preprint arXiv:2306.05426, 2023.
- Training gans with optimism. arXiv preprint arXiv:1711.00141, 2017.
- Bilinear classes: A structural framework for provable generalization in rl. In International Conference on Machine Learning, pp. 2826–2836. PMLR, 2021.
- Inverse optimal control with linearly-solvable mdps. In International Conference on Machine Learning, 2010.
- Learning robust rewards with adverserial inverse reinforcement learning. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rkHywl-A-.
- D4rl: Datasets for deep data-driven reinforcement learning, 2020.
- A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145, 2021.
- Iq-learn: Inverse soft-q learning for imitation. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=Aeo-xqtb5p.
- Improved training of wasserstein gans. Advances in neural information processing systems, 30, 2017.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, 2018.
- Deep q-learning from demonstrations. In AAAI, 2018a.
- Deep q-learning from demonstrations. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018b.
- Generative adversarial imitation learning. In 30th Conference on Neural Information Processing Systems, 2016.
- Symphony: Learning realistic and diverse agents for autonomous driving simulation. arXiv preprint arXiv:2205.03195, 2022.
- When to trust your model: Model-based policy optimization. Advances in neural information processing systems, 32, 2019.
- Kakade, S. M. A natural policy gradient. Advances in neural information processing systems, 14, 2001.
- Kakade, S. M. On the sample complexity of reinforcement learning. University of London, University College London (United Kingdom), 2003.
- Learning shared safety constraints from multi-task demonstrations. arXiv preprint arXiv:2309.00711, 2023.
- A control architecture for quadruped locomotion over rough terrain. In 2008 IEEE International Conference on Robotics and Automation, pp. 811–818. IEEE, 2008.
- Serl: A software suite for sample-efficient robotic reinforcement learning. In Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition@ CoRL2023, 2023.
- Autonomous inverted helicopter flight via reinforcement learning. In Experimental robotics IX, pp. 363–372. Springer, 2006.
- Mbrl-lib: A modular library for model-based reinforcement learning. arXiv preprint arXiv:2104.10159, 2021.
- Pomerleau, D. A. Alvinn: An autonomous land vehicle in a neural network. Advances in neural information processing systems, 1, 1988.
- Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
- Stable baselines3, 2019.
- Learning to search: Functional gradient techniques for imitation learning. Autonomous Robots, 27(1):25–53, 2009.
- Sqil: Imitation learning via regularized behavioral cloning. ArXiv, abs/1905.11108, 2019.
- Agnostic system identification for model-based reinforcement learning. arXiv preprint arXiv:1203.1007, 2012.
- Reinforcement and imitation learning via interactive no-regret learning. ArXiv, abs/1406.5979, 2014.
- A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, 2011.
- Learning from demonstration for autonomous navigation in complex unstructured terrain. The International Journal of Robotics Research, 29(12):1565–1592, 2010.
- Hybrid rl: Using both offline and online data can make rl efficient. arXiv preprint arXiv:2210.06718, 2022.
- Of moments and matching: A game-theoretic framework for closing the imitation gap. Proceedings of the 38th International Conference on Machine Learning, 2021. URL https://arxiv.org/abs/2103.03236.
- Minimax optimal online imitation learning via replay estimation. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=1mFfKXYMg5a.
- Inverse reinforcement learning without reinforcement learning. ArXiv, abs/2303.14623, 2023.
- Regularized rl. arXiv preprint arXiv:2310.17303, 2023.
- The virtues of laziness in model-based rl: A unified objective and algorithms. In International Conference on Machine Learning, pp. 34978–35005. PMLR, 2023.
- Nocturne: a scalable driving benchmark for bringing multi-agent learning one step closer to the real world. arXiv preprint arXiv:2206.09889, 2022.
- Instabilities of offline rl with pre-trained neural representation. In International Conference on Machine Learning, pp. 10948–10960. PMLR, 2021.
- Q* approximation schemes for batch reinforcement learning: A theoretical comparison. In Conference on Uncertainty in Artificial Intelligence, pp. 550–559. PMLR, 2020.
- Offline data enhanced on-policy policy gradient with provable guarantees. arXiv preprint arXiv:2311.08384, 2023.
- Maximum entropy inverse reinforcement learning. In AAAI Conference on Artificial Intelligence, 2008a.
- Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pp. 1433–1438. Chicago, IL, USA, 2008b.
- Navigate like a cabbie: Probabilistic reasoning from observed context-aware behavior. In Proceedings of the 10th international conference on Ubiquitous computing, pp. 322–331, 2008c.
- Zinkevich, M. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th international conference on machine learning (icml-03), pp. 928–936, 2003.
- Optimization and learning for rough terrain legged locomotion. The International Journal of Robotics Research, 30(2):175–191, 2011.
- Juntao Ren (5 papers)
- Gokul Swamy (26 papers)
- Zhiwei Steven Wu (143 papers)
- J. Andrew Bagnell (64 papers)
- Sanjiban Choudhury (62 papers)