Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Survey of Temporal Credit Assignment in Deep Reinforcement Learning (2312.01072v2)

Published 2 Dec 2023 in cs.LG and cs.AI

Abstract: The Credit Assignment Problem (CAP) refers to the longstanding challenge of Reinforcement Learning (RL) agents to associate actions with their long-term consequences. Solving the CAP is a crucial step towards the successful deployment of RL in the real world since most decision problems provide feedback that is noisy, delayed, and with little or no information about the causes. These conditions make it hard to distinguish serendipitous outcomes from those caused by informed decision-making. However, the mathematical nature of credit and the CAP remains poorly understood and defined. In this survey, we review the state of the art of Temporal Credit Assignment (CA) in deep RL. We propose a unifying formalism for credit that enables equitable comparisons of state-of-the-art algorithms and improves our understanding of the trade-offs between the various methods. We cast the CAP as the problem of learning the influence of an action over an outcome from a finite amount of experience. We discuss the challenges posed by delayed effects, transpositions, and a lack of action influence, and analyse how existing methods aim to address them. Finally, we survey the protocols to evaluate a credit assignment method and suggest ways to diagnose the sources of struggle for different methods. Overall, this survey provides an overview of the field for new-entry practitioners and researchers, it offers a coherent perspective for scholars looking to expedite the starting stages of a new study on the CAP, and it suggests potential directions for future research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (220)
  1. On the expressivity of markov reward. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W., editors, Advances in Neural Information Processing Systems.
  2. Deep reinforcement learning at the edge of the statistical precipice. Advances in neural information processing systems, 34:29304–29320.
  3. Al-Emran, M. (2015). Hierarchical reinforcement learning: a survey. International journal of computing and digital systems, 4(02).
  4. A survey of exploration methods in reinforcement learning. arXiv preprint arXiv:2109.00157.
  5. Hindsight experience replay. In Advances in neural information processing systems, volume 30.
  6. Learning to play no-press diplomacy with best response policy iteration. Advances in Neural Information Processing Systems, 33:17987–18003.
  7. Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems, 32.
  8. All you need is supervised learning: From imitation learning to meta-rl with upside down rl. arXiv preprint arXiv:2202.11960.
  9. Deep reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6):26–38.
  10. An information-theoretic perspective on credit assignment in reinforcement learning. CoRR, abs/2103.06224.
  11. Learning relative return policies with upside-down reinforcement learning. arXiv preprint arXiv:2202.12742.
  12. The option-critic architecture. In Proceedings of the AAAI conference on artificial intelligence, volume 31.
  13. Agent57: Outperforming the atari human benchmark. In International Conference on Machine Learning, pages 507–517. PMLR.
  14. Option discovery using deep skill chaining. In International Conference on Learning Representations.
  15. Baird, L. C. I. (1999). Reinforcement Learning Through Gradient Descent. PhD thesis, US Air Force Academy.
  16. Kickback cuts backprop’s red-tape: Biologically plausible credit assignment in neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29.
  17. On Pearl’s Hierarchy and the Foundations of Causal Inference, page 507–556. Association for Computing Machinery, New York, NY, USA, 1 edition.
  18. Barto, A. G. (1997). Reinforcement learning. In Neural systems for control, pages 7–30. Elsevier.
  19. Barto, A. G. (2013). Intrinsic motivation and reinforcement learning. Intrinsically motivated learning in natural and artificial systems, pages 17–47.
  20. Recent advances in hierarchical reinforcement learning. Discrete event dynamic systems, 13(1):41–77.
  21. Intrinsically motivated learning of hierarchical collections of skills. In Proceedings of the 3rd International Conference on Development and Learning, volume 112, page 19. Citeseer.
  22. Deepmind lab. arXiv preprint arXiv:1612.03801.
  23. Adversarial exploitation of policy imitation. arXiv preprint arXiv:1906.01121.
  24. Autonomous navigation of stratospheric balloons using reinforcement learning. Nature, 588(7836):77–82.
  25. A distributional perspective on reinforcement learning. In International Conference on Machine Learning, pages 449–458. Proceedings of Machine Learning Research.
  26. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279.
  27. Increasing the action gap: New operators for reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30.
  28. Settling the reward hypothesis. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J., editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 3003–3020. PMLR.
  29. Woulda, coulda, shoulda: Counterfactually-guided policy search. In International Conference on Learning Representations.
  30. All learning is local: Multi-agent learning in global reward games. Advances in neural information processing systems, 16.
  31. Selective credit assignment. arXiv preprint arXiv:2202.09699.
  32. Forethought and hindsight in credit assignment. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 2270–2281. Curran Associates, Inc.
  33. Decision transformer: Reinforcement learning via sequence modeling. In Advances in Neural Information Processing Systems, volume 34, pages 15084–15097.
  34. Self-imitation learning in sparse reward settings. arXiv preprint arXiv:2010.06962.
  35. Intrinsically motivated reinforcement learning. Advances in neural information processing systems, 17.
  36. Minimalistic gridworld environment for openai gym. https://github.com/maximecb/gym-minigrid.
  37. Autotelic agents with intrinsically motivated goal-conditioned reinforcement learning: a short survey. Journal of Artificial Intelligence Research, 74:1159–1199.
  38. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature, 602(7897):414–419.
  39. Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679.
  40. Forward-backward reinforcement learning. arXiv preprint arXiv:1803.10227.
  41. The goal construct in psychology. In Handbook of motivation science, volume 18, pages 235–250.
  42. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070.
  43. Parameter-based value functions. In International Conference on Learning Representations.
  44. Farahmand, A.-m. (2011). Action-gap phenomenon in reinforcement learning. Advances in Neural Information Processing Systems, 24.
  45. Ferret, J. (2022). On Actions that Matter: Credit Assignment and Interpretability in Reinforcement Learning. PhD thesis, Université de Lille.
  46. Self-attentional credit assignment for transfer in reinforcement learning. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI’20.
  47. Self-imitation advantage learning. In AAMAS 2021-20th International Conference on Autonomous Agents and Multiagent Systems.
  48. Can autonomous vehicles identify, recover from, and adapt to distribution shifts? In International Conference on Machine Learning, pages 3145–3153. PMLR.
  49. Bootstrapped meta-learning. arXiv preprint arXiv:2109.04504.
  50. Flet-Berliac, Y. (2019). The promise of hierarchical reinforcement learning. The Gradient, 9.
  51. Counterfactual multi-agent policy gradients. In Proceedings of the AAAI conference on artificial intelligence, volume 32.
  52. Generalized decision transformer for offline hindsight information matching. In International Conference on Learning Representations.
  53. Gao, J. (2014). Machine learning applications for data center optimization.
  54. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(42):1437–1480.
  55. Off-policy learning with eligibility traces: a survey. J. Mach. Learn. Res., 15(1):289–333.
  56. Recall traces: Backtracking models for efficient reinforcement learning. In International Conference on Learning Representations.
  57. Z-forcing: Training stochastic recurrent networks. Advances in neural information processing systems, 30.
  58. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):471–476.
  59. Learning to transduce with unbounded memory. Advances in neural information processing systems, 28.
  60. There is no turning back: A self-supervised approach for reversibility-aware reinforcement learning. Advances in Neural Information Processing Systems, 34:1898–1911.
  61. Value-driven hindsight modelling. Advances in Neural Information Processing Systems, 33:12499–12509.
  62. Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. arXiv preprint arXiv:1910.11956.
  63. Reinforcement learning with deep energy-based policies. In International conference on machine learning, pages 1352–1361. PMLR.
  64. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1861–1870. Proceedings of Machine Learning Research.
  65. Deep hierarchical planning from pixels. Advances in Neural Information Processing Systems, 35:26091–26104.
  66. Policy evaluation networks. arXiv preprint arXiv:2002.11833.
  67. Hindsight credit assignment. Advances in neural information processing systems, 32.
  68. Learning with options that terminate off-policy. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32.
  69. Deep reinforcement learning that matters. In Proceedings of the AAAI conference on artificial intelligence, volume 32.
  70. Hesterberg, T. (1995). Weighted average importance sampling and defensive mixture distributions. Technometrics, 37(2):185–194.
  71. Hoffman, D. D. (2016). The interface theory of perception. Current Directions in Psychological Science, 25(3):157–161.
  72. Evolved policy gradients. Advances in Neural Information Processing Systems, 31.
  73. Howard, R. A. (1960). Dynamic programming and Markov processes. John Wiley.
  74. Learning to utilize shaping rewards: A new approach of reward shaping. Advances in Neural Information Processing Systems, 33:15931–15941.
  75. Optimizing agent behavior over long time scales by transporting value. Nature Communications, 10(1):5223.
  76. Convergence of stochastic iterative dynamic programming algorithms. Advances in neural information processing systems, 6.
  77. Reinforcement learning with unsupervised auxiliary tasks. In International Conference on Learning Representations.
  78. Offline reinforcement learning as one big sequence modeling problem. In Advances in Neural Information Processing Systems.
  79. Quantifying causal influences. The Annals Of Statistics, pages 2324–2358.
  80. Jaquette, S. C. (1973). Markov decision processes with a new optimality criterion: Discrete time. The Annals of Statistics, 1(3):496–505.
  81. Prioritized level replay. In International Conference on Machine Learning, pages 4940–4950. Proceedings of Machine Learning Research.
  82. General intelligence requires rethinking exploration. Royal Society Open Science, 10(6):230539.
  83. Emphatic algorithms for deep reinforcement learning. In Meila, M. and Zhang, T., editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 5023–5033. PMLR.
  84. Graph backup: Data efficient backup exploiting markovian data. In Deep RL Workshop NeurIPS 2021.
  85. Human-level atari 200x faster. arXiv preprint arXiv:2209.07550.
  86. Human-level atari 200x faster. In The Eleventh International Conference on Learning Representations.
  87. Recurrent experience replay in distributed reinforcement learning. In International conference on learning representations.
  88. Vizdoom: A doom-based ai research platform for visual reinforcement learning. In 2016 IEEE conference on computational intelligence and games (CIG), pages 1–8. IEEE.
  89. A survey of zero-shot generalisation in deep reinforcement learning. Journal of Artificial Intelligence Research, 76:201–264.
  90. Adaptive interest for emphatic reinforcement learning. In Decision Awareness in Reinforcement Learning Workshop at ICML 2022.
  91. Flexible option learning. Advances in Neural Information Processing Systems, 34:4632–4646.
  92. Klopf, A. H. (1972). Brain function and adaptive systems: a heterostatic theory. Technical Report 133, Air Force Cambridge Research Laboratories. Special Reports, Bedford, Massachusets.
  93. Reinforcement learning in robotics: Applications and real-world challenges. Robotics, 2(3):122–148.
  94. The nethack learning environment. Advances in Neural Information Processing Systems, 33:7671–7684.
  95. Exploration in deep reinforcement learning: A survey. Information Fusion, 85:1–22.
  96. Bidirectional model-based policy optimization. In International Conference on Machine Learning, pages 5618–5627. PMLR.
  97. Professor forcing: A new algorithm for training recurrent networks. Advances in neural information processing systems, 29.
  98. Lattal, K. A. (2010). Delayed reinforcement of operant behavior. Journal of the Experimental Analysis of Behavior, 93(1):129–139.
  99. On the generalization of representations in reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 4132–4157. PMLR.
  100. Multi-game decision transformers. arXiv preprint arXiv:2205.15241.
  101. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
  102. Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3):293–321.
  103. A structured self-attentive sentence embedding. In International Conference on Learning Representations.
  104. Goal-conditioned reinforcement learning: Problems and solutions. arXiv preprint arXiv:2201.08299.
  105. Sample-efficient reinforcement learning via counterfactual-based data augmentation. arXiv preprint arXiv:2012.09092.
  106. A survey of reinforcement learning informed by natural language. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, August 10-16 2019, Macao, China., volume 57, pages 6309–6317. AAAI Press (Association for the Advancement of Artificial Intelligence).
  107. Time delays, competitive interdependence, and firm performance. Strategic Management Journal, 38(3):506–525.
  108. Long-term credit assignment via model-based temporal shortcuts. In Deep RL Workshop NeurIPS 2021.
  109. Emphatic temporal-difference learning. arXiv preprint arXiv:1507.01569.
  110. Graph-based skill acquisition for reinforcement learning. ACM Computing Surveys (CSUR), 52(1):1–26.
  111. Quantile credit assignment. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J., editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 24517–24531. PMLR.
  112. Counterfactual credit assignment in model-free reinforcement learning. In International Conference on Machine Learning, pages 7654–7664. Proceedings of Machine Learning Research.
  113. Michie, D. (1963). Experiments on the mechanization of game-learning part i. characterization of the model and its parameters. The Computer Journal, 6(3):232–236.
  114. Minsky, M. (1961). Steps toward artificial intelligence. Proceedings of the IRE, 49(1):8–30.
  115. Chip placement with deep reinforcement learning. arXiv preprint arXiv:2004.10746.
  116. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937. PMLR.
  117. Playing atari with deep reinforcement learning. In Advances in Neural Information Processing Systems, Deep Learning Workshop.
  118. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533.
  119. Applying q (λ𝜆\lambdaitalic_λ)-learning in deep reinforcement learning to play atari games. In AAMAS Adaptive Learning Agents (ALA) Workshop, pages 1–6.
  120. Visual reinforcement learning with imagined goals. Advances in neural information processing systems, 31.
  121. Trass: Time reversal as self-supervision. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 115–121. IEEE.
  122. Policy invariance under reward transformations: Theory and application to reward shaping. In Icml, volume 99, pages 278–287. Citeseer.
  123. Deep reinforcement learning for cyber security. IEEE Transactions on Neural Networks and Learning Systems.
  124. Posterior value functions: Hindsight baselines for policy gradient methods. In Meila, M. and Zhang, T., editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8238–8247. PMLR.
  125. Self-imitation learning. In International Conference on Machine Learning, pages 3878–3887. PMLR.
  126. Discovering reinforcement learning algorithms. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 1060–1070. Curran Associates, Inc.
  127. Behaviour suite for reinforcement learning. In International Conference on Learning Representations.
  128. Direct advantage estimation. Advances in Neural Information Processing Systems, 35:11869–11880.
  129. Hierarchical reinforcement learning: A comprehensive survey. ACM Computing Surveys (CSUR), 54(5):1–35.
  130. Pavlov, P. I. (1927). Conditioned Reflexes. Oxford University Press, London, UK.
  131. Pearl, J. (2009). Causal inference in statistics: An overview. Statistics Surveys, 3:96–146.
  132. Pearl, J. et al. (2000). Models, reasoning and inference. Cambridge, UK: CambridgeUniversityPress, 19(2):3.
  133. Mastering the game of stratego with model-free multiagent reinforcement learning. Science, 378(6623):990–996.
  134. The origins of intelligence in children, volume 8. International Universities Press New York.
  135. Pitis, S. (2019). Rethinking the discount factor in reinforcement learning: A decision theoretic approach. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 7949–7956.
  136. Counterfactual data augmentation using locally factored dynamics. Advances in Neural Information Processing Systems, 33:3976–3990.
  137. Fitness beats truth in the evolution of perception. Acta Biotheoretica, 69:319–341.
  138. Precup, D. (2000a). Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, page 80.
  139. Precup, D. (2000b). Temporal abstraction in reinforcement learning. PhD thesis, University of Massachusetts Amherst.
  140. Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
  141. Modified policy iteration algorithms for discounted markov decision problems. Management Science, 24(11):1127–1137.
  142. Imagination-augmented agents for deep reinforcement learning. Advances in neural information processing systems, 30.
  143. Improving language understanding by generative pre-training. OpenAI blog.
  144. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  145. Effects of feedback delay on learning. System Dynamics Review, 25(4):309–338.
  146. Synthetic returns for long-term credit assignment. CoRR.
  147. Hindsight policy gradients. In International Conference on Learning Representations.
  148. Learning long-term reward redistribution via randomized return decomposition. In International Conference on Learning Representations.
  149. Learning abstract options. Advances in neural information processing systems, 31.
  150. Adaptive trade-offs in off-policy learning. In International Conference on Artificial Intelligence and Statistics, pages 34–44. PMLR.
  151. Minihack the planet: A sandbox for open-ended reinforcement learning research. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
  152. The phenomenon of policy churn. arXiv preprint arXiv:2206.00730.
  153. Universal value function approximators. In Bach, F. and Blei, D., editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1312–1320, Lille, France. PMLR.
  154. Prioritized experience replay. arXiv preprint arXiv:1511.05952.
  155. Approximate modified policy iteration and its application to the game of tetris. Journal of Machine Learning Research, 16(49):1629–1676.
  156. Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61:85–117.
  157. Schmidhuber, J. (2019). Reinforcement learning upside down: Don’t predict rewards–just map them to actions. arXiv preprint arXiv:1912.02875.
  158. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609.
  159. High-dimensional continuous control using generalized advantage estimation. In Proceedings of the International Conference on Learning Representations (ICLR).
  160. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  161. Schultz, D. G. (1967). State functions and linear control systems. McGraw-Hill Book Company.
  162. Rewards prediction-based credit assignment for reinforcement learning with sparse binary rewards. IEEE Access, 7:118776–118791.
  163. Utility theory for sequential decision making. In International Conference on Machine Learning, pages 19616–19625. PMLR.
  164. Shannon, C. E. (1950). Programming a computer for playing chess. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 41(314):256–275.
  165. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489.
  166. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144.
  167. Where do rewards come from. In Proceedings of the annual conference of the cognitive science society, pages 2601–2606. Cognitive Science Society.
  168. Reinforcement learning with replacing eligibility traces. Machine learning, 22(1):123–158.
  169. Skinner, B. F. (1937). Two types of conditioned reflex: A reply to konorski and miller. The Journal of General Psychology, 16(1):272–279.
  170. An inference-based policy gradient method for learning options. In International Conference on Machine Learning, pages 4703–4712. PMLR.
  171. Sobel, M. J. (1982). The variance of discounted markov decision processes. Journal of Applied Probability, 19(4):794–802.
  172. Training agents using upside-down reinforcement learning. CoRR, abs/1912.02877.
  173. Upside-down reinforcement learning can diverge in stochastic environments with episodic resets. arXiv preprint arXiv:2205.06595.
  174. How does value distribution in distributional reinforcement learning help optimization? arXiv preprint arXiv:2209.14513.
  175. A new q (lambda) with interim forward view and monte carlo equivalence. In International Conference on Machine Learning, pages 568–576. PMLR.
  176. Sutton, R. S. (1984). Temporal credit assignment in reinforcement learning. PhD thesis, University of Massachusetts.
  177. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine learning, 3:9–44.
  178. Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine learning proceedings 1990, pages 216–224. Elsevier.
  179. Sutton, R. S. (1992). Adapting bias by gradient descent: An incremental version of delta-bar-delta. In AAAI, pages 171–176. Citeseer.
  180. Sutton, R. S. (2004). The reward hypothesis. http://incompleteideas.net/rlai.cs.ualberta.ca/RLAI/rewardhypothesis.html.
  181. Reinforcement Learning: an introduction. MIT Press, 2nd edition.
  182. An emphatic approach to the problem of off-policy temporal-difference learning. The Journal of Machine Learning Research, 17(1):2603–2631.
  183. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, pages 761–768.
  184. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211.
  185. Hindsight expectation maximization for goal-conditioned reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 2863–2871. PMLR.
  186. Thorndike, E. L. (1898). Animal intelligence: An experimental study of the associative processes in animals. American Journal of Psychology, 2(4).
  187. Mujoco: A physics engine for model-based control. In International conference on intelligent robots and systems, pages 5026–5033. IEEE.
  188. Deep reinforcement learning and the deadly triad. arXiv preprint arXiv:1812.02648.
  189. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artificial intelligence, volume 30.
  190. Expected eligibility traces. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 9997–10005.
  191. Learning to predict independent of span. arXiv preprint arXiv:1508.04582.
  192. Using continuous action spaces to solve discrete problems. In 2009 International Joint Conference on Neural Networks, pages 1149–1156. IEEE.
  193. When to use parametric models in reinforcement learning? Advances in Neural Information Processing Systems, 32.
  194. Attention is all you need. Advances in neural information processing systems, 30.
  195. Policy gradients incorporating the future. In International Conference on Learning Representations.
  196. Leverage the average: an analysis of kl regularization in reinforcement learning. Advances in Neural Information Processing Systems, 33:12163–12174.
  197. Munchausen reinforcement learning. Advances in Neural Information Processing Systems, 33:4235–4246.
  198. AlphaStar: Mastering the Real-Time Strategy Game StarCraft II. https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/.
  199. Starcraft ii: A new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782.
  200. Offline reinforcement learning with reverse model-based imagination. Advances in Neural Information Processing Systems, 34:29420–29432.
  201. Adversarial policies beat professional-level go ais. arXiv preprint arXiv:2211.00241.
  202. Sample efficient actor-critic with experience replay. In International Conference on Learning Representations.
  203. Reinforcement learning for building controls: The opportunities and challenges. Applied Energy, 269:115036.
  204. Dueling network architectures for deep reinforcement learning. In International conference on machine learning, pages 1995–2003. PMLR.
  205. Watkins, C. J. C. H. (1989). Learning from delayed rewards. PhD thesis, King’s College, Cambridge, United Kingdom.
  206. White, D. J. (1988). Mean, variance, and probabilistic criteria in finite markov decision processes: A review. Journal of Optimization Theory and Applications, 56:1–29.
  207. White, M. (2017). Unifying task specification in reinforcement learning. In International Conference on Machine Learning, pages 3742–3750. PMLR.
  208. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280.
  209. Outracing champion gran turismo drivers with deep reinforcement learning. Nature, 602(7896):223–228.
  210. Meta-gradient reinforcement learning with an objective discovered online. Advances in Neural Information Processing Systems, 33:15254–15264.
  211. Meta-gradient reinforcement learning. Advances in neural information processing systems, 31.
  212. Mastering atari games with limited data. Advances in Neural Information Processing Systems, 34:25476–25488.
  213. Distributional meta-gradient reinforcement learning. In The Eleventh International Conference on Learning Representations.
  214. Self-tuning deep reinforcement learning. arXiv preprint arXiv:2002.12928.
  215. Relational deep reinforcement learning. arXiv preprint arXiv:1806.01830.
  216. Learning retrospective knowledge with reverse reinforcement learning. Advances in Neural Information Processing Systems, 33:19976–19987.
  217. Online decision transformer. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S., editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 27042–27059. PMLR.
  218. What can learned intrinsic rewards capture? In International Conference on Machine Learning, pages 11436–11446. PMLR.
  219. On learning intrinsic rewards for policy gradient methods. Advances in Neural Information Processing Systems, 31.
  220. Reward shaping via meta-learning. arXiv preprint arXiv:1901.09330.
Citations (6)

Summary

  • The paper presents a unified framework that maps actions, contexts, and outcomes to quantify credit in deep RL.
  • The study categorizes TCA challenges by depth, density, and breadth, clarifying long-term impact, signal sparsity, and decision pathway diversity.
  • It emphasizes the need for tailored evaluation benchmarks and open-source frameworks to enhance reproducibility and guide future research.

Temporal Credit Assignment (TCA) is a fundamental concept in the field of reinforcement learning (RL), a branch of AI focused on how agents learn to make decisions by interacting with their environment. The Credit Assignment Problem (CAP) deals with the challenge of identifying which actions are responsible for particular outcomes—especially when rewards or feedback are delayed. Addressing the CAP effectively is vital for developing RL algorithms that can be deployed in real-world situations where decision-making consequences are often complex and not immediately apparent.

Recently, there has been a surge in research attempting to untangle the complexities of TCA within Deep Reinforcement Learning (Deep RL). In the survey "Temporal Credit Assignment in Deep RL," researchers explore the current state of understanding around how to effectively attribute credit to actions in RL. They aim to provide a unified perspective, identifying principal challenges and suggesting potential directions for future research in the area.

The survey casts TCA as a problem of approximating causal action influence from experience. To address TCA, the paper presents a framework called "assignments," which are functions mapping actions, contexts (comprising past actions, present circumstances, and policy for future actions), and outcomes (goals) to a quantified measure of action influence. This allows for a systematic comparison of different TCA methods and algorithms.

One key aspect discussed in the paper is the identification of three primary dimensions of the CAP within Deep RL: depth, density, and breadth. These dimensions relate to specific complexities in assigning credit:

  • Depth pertains to how actions can influence long-term outcomes.
  • Density addresses the influence strength of these actions over outcomes, often hindered by sparse reinforcement signals.
  • Breadth involves the variety of potential pathways or decisions that could lead to similar outcomes.

The paper then categorizes various RL algorithms based on the mechanics they employ to allocate credit, such as temporal contiguity, return decomposition, and auxiliary goal conditioning. It also covers approaches that condition on future outcomes retrospectively ("hindsight methods") and those that model decisions as sequences or leverage planning techniques.

Finally, the survey examines methods for evaluating TCA implementations, stressing the need for metrics and protocols that do not merely apply the standards for RL control but are specifically tailored to assess the credit assignment aspect. It calls for new benchmarks that can isolate and directly evaluate CAP challenges without confounding factors like exploration strategies.

However, the survey highlights remaining gaps and challenges in understanding and implementing TCA. Questions such as what constitutes optimal credit assignment, the role of causality in designing effective TCA systems, and how to develop benchmarks that precisely target CAP-related issues remain open. The need for open-source, accessible, and well-documented code is also outlined to foster reproducibility and further research in this space. The development of community-driven standards and databases for recording and sharing evaluation results is suggested as vital steps for future progress.

The survey contributes to the RL community by systematizing TCA concepts and challenges, reviewing various approaches to address these challenges, and identifying areas where further research is needed to advance the field.

X Twitter Logo Streamline Icon: https://streamlinehq.com