Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
131 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Actor-Critic or Critic-Actor? A Tale of Two Time Scales (2210.04470v6)

Published 10 Oct 2022 in cs.LG

Abstract: We revisit the standard formulation of tabular actor-critic algorithm as a two time-scale stochastic approximation with value function computed on a faster time-scale and policy computed on a slower time-scale. This emulates policy iteration. We observe that reversal of the time scales will in fact emulate value iteration and is a legitimate algorithm. We provide a proof of convergence and compare the two empirically with and without function approximation (with both linear and nonlinear function approximators) and observe that our proposed critic-actor algorithm performs on par with actor-critic in terms of both accuracy and computational effort.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)
  1. Barto, A. G., Sutton, R. S. and Anderson, C. W., 1983. “Neuronlike adaptive elements that can solve difficult learning control problems”. IEEE transactions on Systems, Man and Cybernetics, (5), pp. 834-846.
  2. Benaim, M., 1996. “A dynamical systems approach to stochastic approximations”, SIAM J.Control and Optimization, 34(2), pp.437-472.
  3. Bhatnagar, S., 2010. “An actor–critic algorithm with function approximation for discounted cost constrained Markov decision processes”, Systems and Control Letters, 59(12), pp.  760-766.
  4. Bhatnagar, S. and Babu, K. M., 2008. “New algorithms of the Q-learning type”. Automatica, 44(4), pp. 1111-1119.
  5. Bhatnagar, S. and Lakshmanan, K., 2012. “An online actor–critic algorithm with function approximation for constrained Markov decision processes”, Journal of Optimization Theory and Applications, 153, pp.  688-708.
  6. Bhatnagar, S. and Lakshmanan, K., 2016. “Multiscale Q-learning with linear function approximation”. Discrete Event Dynamic Systems, 26(3), pp. 477-509.
  7. Bhatnagar, S., Sutton, R., Ghavamzadeh, M., and Lee, M., 2009. “Natural actor-critic algorithms”. Automatica, 45(11), pp. 2471-2482.
  8. Borkar, V. S., 1998. “Asynchronous stochastic approximation”. SIAM Journal on Control and Optimization, 36(3), pp. 840-851. (Erratum in SIAM Journal on Control and Optimization, 38(2), 2000, pp. 662-663).
  9. Borkar, V. S., 2005. “An actor-critic algorithm for constrained Markov decision processes”, Systems and control letters, 54(3), pp. 207-213.
  10. Konda, V.R. and Borkar, V.S., 1999. “Actor-critic–type learning algorithms for Markov decision processes”. SIAM Journal on control and Optimization, 38(1), pp. 94-123.
  11. Konda, V. R. and Tsitsiklis, J. N., 2003. “On actor-critic algorithms”. SIAM journal on Control and Optimization, 42(4), pp. 1143-1166.
  12. L.Ljung, L., 1977. “Analysis of recursive stochastic algorithms”, IEEE Transactions on Automatic Control, AC-22(4), pp. 551-575.
  13. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G, Petersen S., 2015. “Human-level control through deep reinforcement learning”, Nature, 518(7540), pp. 529-33.
  14. Sutton, R., 1988. “Learning to predict by the method of temporal differences”. Machine Learning, 3, pp. 9-44.
  15. Sutton, R., McAllester, D., Singh, S., and Mansour, Y., 1999. “Policy gradient methods for reinforcement learning with function approximation”. Proceedings of NeurIPS, pp. 1057-1163.
  16. Tsitsiklis, J. N. and Van Roy, B., 1997. “An analysis of temporal difference learning with function approximation”, IEEE Transactions on Automatic Control, 42(5), pp. 674-690.
  17. Watkins, C. J. C. H. and Dayan, P., 1992. “Q-learning”. Machine Learning, 8, pp. 279-292.
Citations (4)

Summary

We haven't generated a summary for this paper yet.