Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Skill or Luck? Return Decomposition via Advantage Functions (2402.12874v1)

Published 20 Feb 2024 in cs.LG

Abstract: Learning from off-policy data is essential for sample-efficient reinforcement learning. In the present work, we build on the insight that the advantage function can be understood as the causal effect of an action on the return, and show that this allows us to decompose the return of a trajectory into parts caused by the agent's actions (skill) and parts outside of the agent's control (luck). Furthermore, this decomposition enables us to naturally extend Direct Advantage Estimation (DAE) to off-policy settings (Off-policy DAE). The resulting method can learn from off-policy trajectories without relying on importance sampling techniques or truncating off-policy actions. We draw connections between Off-policy DAE and previous methods to demonstrate how it can speed up learning and when the proposed off-policy corrections are important. Finally, we use the MinAtar environments to illustrate how ignoring off-policy corrections can lead to suboptimal policy optimization performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Planning in stochastic environments with a learned model. In International Conference on Learning Representations, 2021.
  2. Leemon C Baird. Reinforcement learning in continuous time: Advantage updating. In Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN’94), volume 4, pp.  2448–2453. IEEE, 1994.
  3. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, jun 2013.
  4. Disentangling causal effects for hierarchical reinforcement learning. arXiv preprint arXiv:2010.01351, 2020.
  5. Multi-step reinforcement learning: A unifying algorithm. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  6. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International conference on machine learning, pp. 1407–1416. PMLR, 2018.
  7. Game theory. MIT press, 1991.
  8. Deepmdp: Learning continuous latent space models for representation learning. In International Conference on Machine Learning, pp. 2170–2179. PMLR, 2019.
  9. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(9), 2004.
  10. The value equivalence principle for model-based reinforcement learning. Advances in Neural Information Processing Systems, 33:5541–5552, 2020.
  11. The reactor: A fast and sample-efficient actor-critic agent for reinforcement learning. arXiv preprint arXiv:1704.04651, 2017.
  12. Mastering atari with discrete world models. In International Conference on Learning Representations, 2020.
  13. Understanding multi-step deep reinforcement learning: A systematic study of the dqn target. arXiv preprint arXiv:1901.07510, 2019.
  14. Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  15. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
  16. Approximately optimal approximate reinforcement learning. In In Proc. 19th International Conference on Machine Learning. Citeseer, 2002.
  17. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  18. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  19. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
  20. Counterfactual credit assignment in model-free reinforcement learning. In International Conference on Machine Learning, pp. 7654–7664. PMLR, 2021.
  21. Marvin Minsky. Steps toward artificial intelligence. Proceedings of the IRE, 49(1):8–30, 1961.
  22. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. PMLR, 2016.
  23. Safe and efficient off-policy reinforcement learning. Advances in neural information processing systems, 29, 2016.
  24. Direct advantage estimation. Advances in Neural Information Processing Systems, 35:11869–11880, 2022.
  25. Eligibility traces for off-policy policy evaluation. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, pp.  759–766, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. ISBN 1558607072.
  26. Off-policy temporal-difference learning with function approximation. In ICML, pp.  417–424, 2001.
  27. Adaptive trade-offs in off-policy learning. In International Conference on Artificial Intelligence and Statistics, pp.  34–44. PMLR, 2020.
  28. The phenomenon of policy churn. arXiv preprint arXiv:2206.00730, 2022.
  29. Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. PMLR, 2015a.
  30. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015b.
  31. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  32. Data-efficient reinforcement learning with self-predictive representations. arXiv preprint arXiv:2007.05929, 2020.
  33. Learning structured output representation using deep conditional generative models. Advances in neural information processing systems, 28, 2015.
  34. Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
  35. Introduction to reinforcement learning. 1998.
  36. Csaba Szepesvári. Algorithms for reinforcement learning. Synthesis lectures on artificial intelligence and machine learning, 4(1):1–103, 2010.
  37. Neural discrete representation learning. In NIPS, 2017.
  38. Sample efficient actor-critic with experience replay. arXiv preprint arXiv:1611.01224, 2016a.
  39. Dueling network architectures for deep reinforcement learning. In International conference on machine learning, pp. 1995–2003. PMLR, 2016b.
  40. Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. PhD thesis, 1989.
  41. Minatar: An atari-inspired testbed for thorough and reproducible reinforcement learning experiments. arXiv preprint arXiv:1903.03176, 2019.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets