Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Topological Experience Replay (2203.15845v3)

Published 29 Mar 2022 in cs.LG and cs.AI

Abstract: State-of-the-art deep Q-learning methods update Q-values using state transition tuples sampled from the experience replay buffer. This strategy often uniformly and randomly samples or prioritizes data sampling based on measures such as the temporal difference (TD) error. Such sampling strategies can be inefficient at learning Q-function because a state's Q-value depends on the Q-value of successor states. If the data sampling strategy ignores the precision of the Q-value estimate of the next state, it can lead to useless and often incorrect updates to the Q-values. To mitigate this issue, we organize the agent's experience into a graph that explicitly tracks the dependency between Q-values of states. Each edge in the graph represents a transition between two states by executing a single action. We perform value backups via a breadth-first search starting from that expands vertices in the graph starting from the set of terminal states and successively moving backward. We empirically show that our method is substantially more data-efficient than several baselines on a diverse range of goal-reaching tasks. Notably, the proposed method also outperforms baselines that consume more batches of training experience and operates from high-dimensional observational data such as images.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
  2. Dimitri P Bertsekas et al. Dynamic programming and optimal control: Vol. 1. Athena scientific Belmont, 2000.
  3. Openai gym, 2016.
  4. Exploration by random network distillation. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=H1lJJnR5Ym.
  5. Minimalistic gridworld environment for openai gym. https://github.com/maximecb/gym-minigrid, 2018.
  6. Prioritizing bellman backups without a priority queue. In ICAPS, pp.  113–119, 2007.
  7. Topological value iteration algorithms. Journal of Artificial Intelligence Research, 42:181–209, 2011.
  8. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pp.  465–472. Citeseer, 2011.
  9. Bootstrap confidence intervals. Statistical science, 11(3):189–228, 1996.
  10. Sparse graphical memory for robust planning. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  5251–5262. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/385822e359afa26d52b5b286226f2cea-Paper.pdf.
  11. Search on the replay buffer: Bridging planning and reinforcement learning. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/5c48ff18e0a47baaf81d8b8ea51eec92-Paper.pdf.
  12. Revisiting fundamentals of experience replay. In International Conference on Machine Learning, pp.  3061–3071. PMLR, 2020.
  13. Reverse curriculum generation for reinforcement learning. In Conference on robot learning, pp.  482–495. PMLR, 2017.
  14. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pp.  1587–1596. PMLR, 2018.
  15. Chainerrl: A deep reinforcement learning library. Journal of Machine Learning Research, 22(77):1–14, 2021. URL http://jmlr.org/papers/v22/20-376.html.
  16. Recall traces: Backtracking models for efficient reinforcement learning. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HygsfnR9Ym.
  17. On the convergence of techniques that improve value iteration. In The 2013 International Joint Conference on Neural Networks (IJCNN), pp.  1–8. IEEE, 2013.
  18. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pp.  1861–1870. PMLR, 2018.
  19. Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019.
  20. Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193, 2020.
  21. Hado Hasselt. Double q-learning. Advances in neural information processing systems, 23:2613–2621, 2010.
  22. Discor: Corrective feedback in reinforcement learning via distribution correction. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  18560–18572. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/d7f426ccbc6db7e235c57958c21c5dfa-Paper.pdf.
  23. Sample-efficient deep reinforcement learning via episodic backward update. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/e6d8545daa42d5ced125a4bf747b3688-Paper.pdf.
  24. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  25. Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3-4):293–321, 1992.
  26. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
  27. Prioritized sweeping: Reinforcement learning with less data and less time. Machine learning, 13(1):103–130, 1993.
  28. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp.  7559–7566. IEEE, 2018.
  29. Semi-parametric topological memory for navigation. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=SygwwGbRW.
  30. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
  31. Prioritized experience replay. In ICLR (Poster), 2016. URL http://arxiv.org/abs/1511.05952.
  32. Max-Philipp B. Schrader. gym-sokoban. https://github.com/mpSchrader/gym-sokoban, 2018.
  33. Generative predecessor models for sample-efficient imitation learning. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=SkeVsiAcYm.
  34. Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
  35. Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin, 2(4):160–163, 1991.
  36. Reinforcement learning: An introduction. MIT press, 2018.
  37. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.
  38. When to use parametric models in reinforcement learning? In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/1b742ae215adf18b75449c6e272fd92d-Paper.pdf.
  39. Q-learning. Machine learning, 8(3-4):279–292, 1992.
  40. World model as a graph: Learning latent landmarks for planning. In International Conference on Machine Learning, pp.  12611–12620. PMLR, 2021.
  41. Episodic reinforcement learning with associative memory. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HkxjqxBYDB.
Citations (16)

Summary

We haven't generated a summary for this paper yet.