2000 character limit reached
Actor-Critic or Critic-Actor? A Tale of Two Time Scales (2210.04470v6)
Published 10 Oct 2022 in cs.LG
Abstract: We revisit the standard formulation of tabular actor-critic algorithm as a two time-scale stochastic approximation with value function computed on a faster time-scale and policy computed on a slower time-scale. This emulates policy iteration. We observe that reversal of the time scales will in fact emulate value iteration and is a legitimate algorithm. We provide a proof of convergence and compare the two empirically with and without function approximation (with both linear and nonlinear function approximators) and observe that our proposed critic-actor algorithm performs on par with actor-critic in terms of both accuracy and computational effort.
- Barto, A. G., Sutton, R. S. and Anderson, C. W., 1983. “Neuronlike adaptive elements that can solve difficult learning control problems”. IEEE transactions on Systems, Man and Cybernetics, (5), pp. 834-846.
- Benaim, M., 1996. “A dynamical systems approach to stochastic approximations”, SIAM J.Control and Optimization, 34(2), pp.437-472.
- Bhatnagar, S., 2010. “An actor–critic algorithm with function approximation for discounted cost constrained Markov decision processes”, Systems and Control Letters, 59(12), pp. 760-766.
- Bhatnagar, S. and Babu, K. M., 2008. “New algorithms of the Q-learning type”. Automatica, 44(4), pp. 1111-1119.
- Bhatnagar, S. and Lakshmanan, K., 2012. “An online actor–critic algorithm with function approximation for constrained Markov decision processes”, Journal of Optimization Theory and Applications, 153, pp. 688-708.
- Bhatnagar, S. and Lakshmanan, K., 2016. “Multiscale Q-learning with linear function approximation”. Discrete Event Dynamic Systems, 26(3), pp. 477-509.
- Bhatnagar, S., Sutton, R., Ghavamzadeh, M., and Lee, M., 2009. “Natural actor-critic algorithms”. Automatica, 45(11), pp. 2471-2482.
- Borkar, V. S., 1998. “Asynchronous stochastic approximation”. SIAM Journal on Control and Optimization, 36(3), pp. 840-851. (Erratum in SIAM Journal on Control and Optimization, 38(2), 2000, pp. 662-663).
- Borkar, V. S., 2005. “An actor-critic algorithm for constrained Markov decision processes”, Systems and control letters, 54(3), pp. 207-213.
- Konda, V.R. and Borkar, V.S., 1999. “Actor-critic–type learning algorithms for Markov decision processes”. SIAM Journal on control and Optimization, 38(1), pp. 94-123.
- Konda, V. R. and Tsitsiklis, J. N., 2003. “On actor-critic algorithms”. SIAM journal on Control and Optimization, 42(4), pp. 1143-1166.
- L.Ljung, L., 1977. “Analysis of recursive stochastic algorithms”, IEEE Transactions on Automatic Control, AC-22(4), pp. 551-575.
- Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G, Petersen S., 2015. “Human-level control through deep reinforcement learning”, Nature, 518(7540), pp. 529-33.
- Sutton, R., 1988. “Learning to predict by the method of temporal differences”. Machine Learning, 3, pp. 9-44.
- Sutton, R., McAllester, D., Singh, S., and Mansour, Y., 1999. “Policy gradient methods for reinforcement learning with function approximation”. Proceedings of NeurIPS, pp. 1057-1163.
- Tsitsiklis, J. N. and Van Roy, B., 1997. “An analysis of temporal difference learning with function approximation”, IEEE Transactions on Automatic Control, 42(5), pp. 674-690.
- Watkins, C. J. C. H. and Dayan, P., 1992. “Q-learning”. Machine Learning, 8, pp. 279-292.