Dynamic Programming-based Approximate Optimal Control for Model-Based Reinforcement Learning (2312.14463v1)
Abstract: This article proposes an improved trajectory optimization approach for stochastic optimal control of dynamical systems affected by measurement noise by combining optimal control with maximum likelihood techniques to improve the reduction of the cumulative cost-to-go. A modified optimization objective function that incorporates dynamic programming-based controller design is presented to handle the noise in the system and sensors. Empirical results demonstrate the effectiveness of the approach in reducing stochasticity and allowing for an intermediate step to switch optimization that can allow an efficient balance of exploration and exploitation mechanism for complex tasks by constraining policy parameters to parameters obtained as a result of this improved optimization. This research study also includes theoretical work on the uniqueness of control parameter estimates and also leverages a structure of the likelihood function which has an established theoretical guarantees. Furthermore, a theoretical result is also explored that bridge the gap between the proposed optimization objective function and existing information theory (relative entropy) and optimal control dualities.
- E. Todorov and W. Li, “A generalized iterative lqg method for locally-optimal feedback control of constrained nonlinear stochastic systems,” in Proceedings of the 2005, American Control Conference. IEEE, 2005, pp. 300–306.
- M. Toussaint, “Robot trajectory optimization using approximate inference,” in Proceedings of the 26th annual international conference on machine learning. ACM, 2009, pp. 1049–1056.
- J. Schulman, J. Ho, A. Lee, I. Awwal, H. Bradlow, and P. Abbeel, “Finding locally optimal, collision-free trajectories with sequential convex optimization,” in in Proc. Robotics: Science and Systems, 2013.
- P. Mallick and Z. Chen, “Stochastic optimal control for multivariable dynamical systems using expectation maximization,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
- P. Mallick, Z. Chen, and M. Zamani, “Reinforcement learning using expectation maximization based guided policy search for stochastic dynamics,” Neurocomputing, vol. 484, pp. 79–88, 2022.
- E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” in Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on. IEEE, 2012, pp. 5026–5033.
- Y. Chebotar, K. Hausman, M. Zhang, G. Sukhatme, S. Schaal, and S. Levine, “Combining model-based and model-free updates for trajectory-centric reinforcement learning,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 703–711.
- E. A. Theodorou and E. Todorov, “Relative entropy and free energy dualities: Connections to path integral and kl control,” in 51st IEEE Conference on Decision and Control (CDC). IEEE, 2012, pp. 1466–1473.
- E. A. Theodorou, “Nonlinear stochastic control and information theoretic dualities: Connections, interdependencies and thermodynamic interpretations,” Entropy, vol. 17, no. 5, pp. 3352–3375, 2015.
- P. Dai Pra, L. Meneghini, and W. J. Runggaldier, “Connections between stochastic control and dynamic games,” Mathematics of Control, Signals and Systems, vol. 9, no. 4, pp. 303–326, 1996.
- C. H. Papadimitriou and J. N. Tsitsiklis, “The complexity of markov decision processes,” Mathematics of Operations Research, vol. 12, no. 3, pp. 441–450, 1987. [Online]. Available: http://www.jstor.org/stable/3689975
- R. Platt Jr, R. Tedrake, L. Kaelbling, and T. Lozano-Perez, “Belief space planning assuming maximum likelihood observations,” 2010.
- W. Li and E. Todorov, “Iterative linear quadratic regulator design for nonlinear biological movement systems.” in ICINCO (1), 2004, pp. 222–229.
- Y. Tassa, T. Erez, and E. Todorov, “Synthesis and stabilization of complex behaviors through online trajectory optimization,” in Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on. IEEE, 2012, pp. 4906–4913.
- S. Levine, “Motor skill learning with local trajectory methods,” Ph.D. dissertation, Stanford University, 2014.
- B. D. Ziebart, J. A. Bagnell, and A. K. Dey, “Modeling interaction via the principle of maximum causal entropy,” in International Conference on Machine Learning, 2010.
- H. J. Kappen, V. Gómez, and M. Opper, “Optimal control as a graphical model inference problem,” Machine Learning, vol. 87, no. 2, pp. 159–182, 2012.
- E. Todorov, “Linearly-solvable markov decision problems,” in Advances in Neural Information Processing Systems, 2007, pp. 1369–1376.
- M. L. Littman, “Memoryless policies: Theoretical limitations and practical results,” in From Animals to Animats 3: Proceedings of the third international conference on simulation of adaptive behavior, vol. 3. Cambridge, MA, 1994, p. 238.
- C. Lusena, J. Goldsmith, and M. Mundhenk, “Nonapproximability results for partially observable markov decision processes,” Journal of Artificial Intelligence Research, vol. 14, pp. 83–103, 2001.
- S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334–1373, 2016.
- G. F. Cooper, “A method for using belief networks as influence diagrams,” arXiv preprint arXiv:1304.2346, 2013.
- P. Dayan and G. E. Hinton, “Using expectation-maximization for reinforcement learning,” Neural Computation, vol. 9, no. 2, pp. 271–278, 1997.
- M. Norouzi, S. Bengio, N. Jaitly, M. Schuster, Y. Wu, D. Schuurmans et al., “Reward augmented maximum likelihood for neural structured prediction,” in Advances In Neural Information Processing Systems, 2016, pp. 1723–1731.
- S. Levine and P. Abbeel, “Learning neural network policies with guided policy search under unknown dynamics,” in Advances in Neural Information Processing Systems, 2014, pp. 1071–1079.
- Y. Chebotar, M. Kalakrishnan, A. Yahya, A. Li, S. Schaal, and S. Levine, “Path integral guided policy search,” in 2017 IEEE international conference on robotics and automation (ICRA). IEEE, 2017, pp. 3381–3388.
- A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the em algorithm,” Journal of the Royal Statistical Society. Series B, pp. 1–38, 1977.
- C. J. Wu et al., “On the convergence properties of the em algorithm,” The Annals of Statistics, vol. 11, no. 1, pp. 95–103, 1983.