Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sequence Compression Speeds Up Credit Assignment in Reinforcement Learning (2405.03878v2)

Published 6 May 2024 in cs.LG and cs.AI

Abstract: Temporal credit assignment in reinforcement learning is challenging due to delayed and stochastic outcomes. Monte Carlo targets can bridge long delays between action and consequence but lead to high-variance targets due to stochasticity. Temporal difference (TD) learning uses bootstrapping to overcome variance but introduces a bias that can only be corrected through many iterations. TD($\lambda$) provides a mechanism to navigate this bias-variance tradeoff smoothly. Appropriately selecting $\lambda$ can significantly improve performance. Here, we propose Chunked-TD, which uses predicted probabilities of transitions from a model for computing $\lambda$-return targets. Unlike other model-based solutions to credit assignment, Chunked-TD is less vulnerable to model inaccuracies. Our approach is motivated by the principle of history compression and 'chunks' trajectories for conventional TD learning. Chunking with learned world models compresses near-deterministic regions of the environment-policy interaction to speed up credit assignment while still bootstrapping when necessary. We propose algorithms that can be implemented online and show that they solve some problems much faster than conventional TD($\lambda$).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Selective dyna-style planning under limited model capacity. In International Conference on Machine Learning, pp.  1–10. PMLR, 2020.
  2. On the model-based stochastic value gradient for continuous reinforcement learning. In Learning for Dynamics and Control, pp.  6–20. PMLR, 2021.
  3. Rudder: Return decomposition for delayed rewards. Advances in Neural Information Processing Systems, 32, 2019.
  4. An information-theoretic perspective on credit assignment in reinforcement learning. arXiv preprint arXiv:2103.06224, 2021.
  5. Successor features for transfer in reinforcement learning. Advances in neural information processing systems, 30, 2017.
  6. Forethought and hindsight in credit assignment. Advances in Neural Information Processing Systems, 33:2270–2281, 2020.
  7. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pp.  465–472, 2011.
  8. Recall traces: Backtracking models for efficient reinforcement learning. In International Conference on Learning Representations, 2018.
  9. World models. Preprint arXiv:1803.10122 (variant at NeurIPS 2018), 2018.
  10. Mastering atari with discrete world models. In 9th International Conference on Learning Representations, ICLR, 2021.
  11. Hindsight credit assignment. Advances in neural information processing systems, 32, 2019.
  12. Learning continuous control policies by stochastic value gradients. Advances in neural information processing systems, 28, 2015.
  13. Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34:1273–1286, 2021.
  14. Bias-variance error bounds for temporal difference updates. In COLT, pp.  142–147, 2000.
  15. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, 2015.
  16. Would i have gotten that reward? long-term credit assignment by counterfactual contribution analysis. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  17. Minsky, M. Steps toward artificial intelligence. Proceedings of the IRE, 49(1):8–30, 1961.
  18. Prioritized sweeping: Reinforcement learning with less data and less time. Machine learning, 13:103–130, 1993.
  19. Munro, P. W. A dual back-propagation scheme for scalar reinforcement learning. Proceedings of the Ninth Annual Conference of the Cognitive Science Society, Seattle, WA, pp.  165–176, 1987.
  20. When do transformers shine in rl? decoupling memory from credit assignment. arXiv preprint arXiv:2307.03864, 2023.
  21. Behaviour suite for reinforcement learning. arXiv preprint arXiv:1908.03568, 2019.
  22. A survey of temporal credit assignment in deep reinforcement learning. arXiv preprint arXiv:2312.01072, 2023.
  23. Pitis, S. Source traces for temporal difference learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  24. Puterman, M. L. Markov decision processes. Handbooks in operations research and management science, 2:331–434, 1990.
  25. Adaptive temporal-difference learning for policy evaluation with per-state uncertainty estimates. Advances in Neural Information Processing Systems, 32, 2019.
  26. On-line Q-learning using connectionist systems, volume 37. University of Cambridge, Department of Engineering Cambridge, UK, 1994.
  27. Q-decomposition for reinforcement learning agents. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp.  656–663, 2003.
  28. Schmidhuber, J. Making the world differentiable: On using fully recurrent self-supervised neural networks for dynamic reinforcement learning and planning in non-stationary environments. Technical Report FKI-126-90, http://people.idsia.ch/~juergen/FKI-126-90_(revised)bw_ocr.pdf, Tech. Univ. Munich, 1990.
  29. Schmidhuber, J. Neural sequence chunkers. Technical Report FKI-148-91, Institut für Informatik, Technische Universität München, April 1991.
  30. Schmidhuber, J. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234–242, 1992.
  31. Sutton, R. S. Temporal credit assignment in reinforcement learning. University of Massachusetts Amherst, 1984.
  32. Sutton, R. S. Integrated architectures for learning, planning and reacting based on dynamic programming. In Machine Learning: Proceedings of the Seventh International Workshop, 1990.
  33. Reinforcement learning: An introduction. MIT press, 2018.
  34. On step-size and bias in temporal-difference learning. In Proceedings of the eighth Yale workshop on adaptive and learning systems, pp.  91–96. Citeseer, 1994.
  35. Talvitie, E. Self-correcting models for model-based reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017.
  36. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  37. Expected eligibility traces. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  9997–10005, 2021.
  38. When to use parametric models in reinforcement learning? Advances in Neural Information Processing Systems, 32, 2019.
  39. A theoretical and empirical analysis of expected sarsa. In 2009 ieee symposium on adaptive dynamic programming and reinforcement learning, pp.  177–184. IEEE, 2009.
  40. Hybrid reward architecture for reinforcement learning. Advances in Neural Information Processing Systems, 30, 2017.
  41. Watkins, C. J. C. H. Learning from delayed rewards. 1989.
  42. Werbos, P. J. Building and understanding adaptive systems: A statistical/numerical approach to factory automation and brain research. IEEE Transactions on Systems, Man, and Cybernetics, 17, 1987.
  43. A greedy approach to adapting the trace parameter for temporal difference learning. arXiv preprint arXiv:1607.00446, 2016.
  44. Meta-gradient reinforcement learning. Advances in neural information processing systems, 31, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Aditya A. Ramesh (1 paper)
  2. Kenny Young (13 papers)
  3. Louis Kirsch (21 papers)
  4. Jürgen Schmidhuber (124 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com