Model-Based Reinforcement Learning via Stochastic Hybrid Models (2111.06211v3)
Abstract: Optimal control of general nonlinear systems is a central challenge in automation. Enabled by powerful function approximators, data-driven approaches to control have recently successfully tackled challenging applications. However, such methods often obscure the structure of dynamics and control behind black-box over-parameterized representations, thus limiting our ability to understand closed-loop behavior. This paper adopts a hybrid-system view of nonlinear modeling and control that lends an explicit hierarchical structure to the problem and breaks down complex dynamics into simpler localized units. We consider a sequence modeling paradigm that captures the temporal structure of the data and derive an expectation-maximization (EM) algorithm that automatically decomposes nonlinear dynamics into stochastic piecewise affine models with nonlinear transition boundaries. Furthermore, we show that these time-series models naturally admit a closed-loop extension that we use to extract local polynomial feedback controllers from nonlinear experts via behavioral cloning. Finally, we introduce a novel hybrid relative entropy policy search (Hb-REPS) technique that incorporates the hierarchical nature of hybrid models and optimizes a set of time-invariant piecewise feedback controllers derived from a piecewise polynomial approximation of a global state-value function.
- J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” The International Journal of Robotics Research, 2013.
- V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, 2015.
- M. Deisenroth and C. E. Rasmussen, “PILCO: A model-based and data-efficient approach to policy search,” in International Conference on Machine Learning, 2011.
- S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” The Journal of Machine Learning Research, 2016.
- D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson, “Learning latent dynamics for planning from pixels,” in International Conference on Machine Learning, 2019.
- J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International Conference on Machine Learning, 2015.
- T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” in International Conference on Learning Representations, 2016.
- T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International Conference on Machine Learning, 2018.
- W. M. Haddad, V. Chellaboina, and S. G. Nersesov, “Impulsive and hybrid dynamical systems,” Princeton Series in Applied Mathematics, 2006.
- Z. Ghahramani and G. E. Hinton, “Variational learning for switching state-space models,” Neural Computation, 2000.
- M. J. Beal, “Variational algorithms for approximate Bayesian inference,” Ph.D. dissertation, University College London, 2003.
- E. Fox, “Bayesian nonparametric learning of complex dynamical phenomena,” Ph.D. dissertation, Massachusetts Institute of Technology, 2009.
- S. W. Linderman, M. J. Johnson, A. C. Miller, R. P. Adams, D. M. Blei, and L. Paninski, “Bayesian learning and inference in recurrent switching linear dynamical systems,” in International Conference on Artificial Intelligence and Statistics, 2017.
- H. Abdulsamad and J. Peters, “Hierarchical decomposition of nonlinear dynamics and control for system identification and policy distillation,” in Learning for Dynamics and Control, 2020.
- F. Borrelli, A. Bemporad, M. Fodor, and D. Hrovat, “An MPC/hybrid system approach to traction control,” IEEE Transactions on Control Systems Technology, 2006.
- P. Menchinelli and A. Bemporad, “Hybrid model predictive control of a solar air conditioning plant,” European Journal of Control, 2008.
- S. Paoletti, A. L. Juloski, G. Ferrari-Trecate, and R. Vidal, “Identification of hybrid systems: A tutorial,” European Journal of Control, 2007.
- A. Garulli, S. Paoletti, and A. Vicino, “A survey on switched and piecewise affine system identification,” International Federation of Automatic Control, 2012.
- R. Vidal, S. Soatto, Y. Ma, and S. Sastry, “An algebraic geometric approach to the identification of a class of linear hybrid systems,” in IEEE International Conference on Decision and Control, 2003.
- A. Bemporad, J. Roll, and L. Ljung, “Identification of hybrid systems via mixed-integer programming,” in IEEE Conference on Decision and Control, 2001.
- A. L. Juloski, S. Weiland, and W. Heemels, “A Bayesian approach to identification of hybrid systems,” IEEE Transactions on Automatic Control, 2005.
- L. Bako, K. Boukharouba, and S. Lecoeuche, “An l0subscript𝑙0l_{0}italic_l start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT-l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm based optimization procedure for the identification of switched nonlinear systems,” in IEEE Conference on Decision and Control, 2010.
- F. Lauer, G. Bloch, and R. Vidal, “Nonlinear hybrid system identification with kernel models,” in IEEE Conference on Decision and Control, 2010.
- A. Bemporad and S. Di Cairano, “Optimal control of discrete hybrid stochastic automata,” in International Workshop on Hybrid Systems: Computation and Control, 2005.
- E. Sontag, “Nonlinear regulation: The piecewise linear approach,” IEEE Transactions on Automatic Control, 1981.
- F. Zhu and P. J. Antsaklis, “Optimal control of hybrid switched systems: A brief survey,” Discrete Event Dynamic Systems, 2015.
- E. F. Camacho, D. R. Ramírez, D. Limón, D. M. De La Peña, and T. Alamo, “Model predictive control techniques for hybrid systems,” Annual reviews in control, 2010.
- A. Bemporad and M. Morari, “Control of systems integrating logic, dynamics, and constraints,” Automatica, 1999.
- A. Bemporad, F. Borrelli, and M. Morari, “Piecewise linear optimal controllers for hybrid systems,” in American Control Conference, 2000.
- F. Borrelli, M. Baotic, A. Bemporad, and M. Morari, “An efficient algorithm for computing the state feedback optimal control law for discrete time hybrid systems,” in American Control Conference, 2003.
- T. Marcucci and R. Tedrake, “Mixed-integer formulations for optimal control of piecewise-affine systems,” in ACM International Conference on Hybrid Systems: Computation and Control, 2019.
- G. Ackerson and K. Fu, “On state estimation in switching environments,” IEEE Transactions on Automatic Control, 1970.
- J. D. Hamilton, “Analysis of time series subject to changes in regime,” Journal of Econometrics, 1990.
- V. Pavlovic, J. M. Rehg, and J. MacCormick, “Learning switching linear models of human motion,” in Advances in Neural Information Processing Systems, 2001.
- S. M. Oh, J. M. Rehg, T. Balch, and F. Dellaert, “Data-driven MCMC for learning and inference in switching linear dynamic systems,” in National Conference on Artificial Intelligence, 2005.
- B. Mesot and D. Barber, “Switching linear dynamical systems for noise robust speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, 2007.
- U. N. Lerner, “Hybrid Bayesian networks for reasoning about complex systems,” Ph.D. dissertation, Stanford University, 2002.
- M. D. Escobar and M. West, “Bayesian density estimation and inference using mixtures,” Journal of the American Statistical Association, 1995.
- C. E. Rasmussen, “The infinite Gaussian mixture model,” in Advances in Neural Information Processing Systems, 1999.
- M. J. Beal, Z. Ghahramani, and C. E. Rasmussen, “The infinite hidden Markov model,” in Advances in Neural Information Processing Systems, 2002.
- Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “Sharing clusters among related groups: Hierarchical Dirichlet processes,” in Advances in Neural Information Processing Systems, 2005.
- E. Fox, E. B. Sudderth, M. I. Jordan, and A. S. Willsky, “Nonparametric Bayesian learning of switching linear dynamical systems,” in Advances in Neural Information Processing Systems, 2009.
- D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in International Conference on Learning Representations, 2014.
- P. Becker-Ehmck, J. Peters, and P. Van Der Smagt, “Switching linear dynamics for variational Bayes filtering,” in International Conference on Machine Learning, 2019.
- A. G. Barto and S. Mahadevan, “Recent advances in hierarchical reinforcement learning,” Discrete Event Dynamic Systems, 2003.
- R. E. Parr, “Hierarchical control and learning for Markov decision processes,” Ph.D. dissertation, University of California Berkeley, 1998.
- D. Precup, “Temporal abstraction in reinforcement learning,” Ph.D. dissertation, University of Massachusetts Amherst, 2000.
- D. Andre and S. J. Russell, “State abstraction for programmable reinforcement learning agents,” in National Conference on Artificial Intelligence, 2002.
- R. S. Sutton, D. Precup, and S. Singh, “Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning,” Artificial Intelligence, 1999.
- ——, “Intra-option learning about temporally abstract actions.” in International Conference on Machine Learning, 1998.
- S. J. Bradtke and M. O. Duff, “Reinforcement learning methods for continuous-time Markov decision problems,” in Advances in Neural Information Processing Systems, 1995.
- M. Huber and R. A. Grupen, “Learning to coordinate controllers-reinforcement learning on a control basis,” in International Joint Conferences on Artificial Intelligence, 1997.
- M. Huber, “A hybrid architecture for adaptive robot control,” Ph.D. dissertation, University of Massachusetts Amherst, 2000.
- G. Konidaris and A. G. Barto, “Skill discovery in continuous reinforcement learning domains using skill chaining,” in Advances in Neural Information Processing Systems, 2009.
- D. J. Mankowitz, T. A. Mann, and S. Mannor, “Adaptive skills adaptive partitions (ASAP),” in Advances in Neural Information Processing Systems, 2016.
- C. Daniel, H. Van Hoof, J. Peters, and G. Neumann, “Probabilistic inference for determining options in reinforcement learning,” Machine Learning, 2016.
- P.-L. Bacon, J. Harb, and D. Precup, “The option-critic architecture,” in AAAI Conference on Artificial Intelligence, 2017.
- M. Smith, H. Hoof, and J. Pineau, “An inference-based policy gradient method for learning options,” in International Conference on Machine Learning, 2018.
- T. G. Dietterich, “State abstraction in MAXQ hierarchical reinforcement learning,” in Advances in Neural Information Processing Systems, 2000.
- L. Li, T. J. Walsh, and M. L. Littman, “Towards a unified theory of state abstraction for MDPs,” in International Symposium on Artificial Intelligence and Mathematics, 2006.
- R. Akrour, F. Veiga, J. Peters, and G. Neumann, “Regularizing reinforcement learning with state abstraction,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2018.
- S. Calinon, F. D’halluin, E. L. Sauser, D. G. Caldwell, and A. G. Billard, “Learning and reproduction of gestures by imitation,” IEEE Robotics & Automation Magazine, 2010.
- M. Burke, Y. Hristov, and S. Ramamoorthy, “Hybrid system identification using switching density networks,” in Conference on Robot Learning, 2020.
- A. Sosic, A. M. Zoubir, and H. Koeppl, “A Bayesian approach to policy recognition and state representation learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
- D. Barber, “Expectation correction for smoothed inference in switching linear dynamical systems,” Journal of Machine Learning Research, 2006.
- L. E. Baum, T. Petrie, G. Soules, and N. Weiss, “A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains,” The Annals of Mathematical Statistics, 1970.
- A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society, 1977.
- L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” IEEE, 1989.
- Y. Bengio and P. Frasconi, “An input-output HMM architecture,” in Advances in Neural Information Processing Systems, 1995.
- S. Kullback and R. A. Leibler, “On information and sufficiency,” The Annals of Mathematical Statistics, 1951.
- H. Robbins and S. Monro, “A stochastic approximation method,” The Annals of Mathematical Statistics, 1951.
- J. Peters, K. Mülling, and Y. Altun, “Relative entropy policy search,” in AAAI Conference on Artificial Intelligence, 2010.
- H. Van Hoof, J. Peters, and G. Neumann, “Learning of non-parametric control policies with high-dimensional state features,” in International Conference on Artificial Intelligence and Statistics, 2015.
- B. Belousov and J. Peters, “f-Divergence constrained policy improvement,” arXiv preprint arXiv:1801.00056, 2017.
- M. Deisenroth, G. Neumann, and J. Peters, “A survey on policy search for robotics,” Foundations and Trends ® in Robotics, 2013.
- J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
- S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, 1997.
- A. Rahimi and B. Recht, “Random features for large-scale kernel machines,” in Advances in Neural Information Processing Systems, 2008.
- M. J. Wainwright and M. I. Jordan, “Graphical models, exponential families, and variational inference,” Foundations and Trends in Machine Learning, 2008.