Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

vMFER: Von Mises-Fisher Experience Resampling Based on Uncertainty of Gradient Directions for Policy Improvement (2405.08638v1)

Published 14 May 2024 in cs.LG

Abstract: Reinforcement Learning (RL) is a widely employed technique in decision-making problems, encompassing two fundamental operations -- policy evaluation and policy improvement. Enhancing learning efficiency remains a key challenge in RL, with many efforts focused on using ensemble critics to boost policy evaluation efficiency. However, when using multiple critics, the actor in the policy improvement process can obtain different gradients. Previous studies have combined these gradients without considering their disagreements. Therefore, optimizing the policy improvement process is crucial to enhance learning efficiency. This study focuses on investigating the impact of gradient disagreements caused by ensemble critics on policy improvement. We introduce the concept of uncertainty of gradient directions as a means to measure the disagreement among gradients utilized in the policy improvement process. Through measuring the disagreement among gradients, we find that transitions with lower uncertainty of gradient directions are more reliable in the policy improvement process. Building on this analysis, we propose a method called von Mises-Fisher Experience Resampling (vMFER), which optimizes the policy improvement process by resampling transitions and assigning higher confidence to transitions with lower uncertainty of gradient directions. Our experiments demonstrate that vMFER significantly outperforms the benchmark and is particularly well-suited for ensemble structures in RL.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Selective dyna-style planning under limited model capacity. In International Conference on Machine Learning, pages 1–10, 2020.
  2. i-sim2real: Reinforcement learning of robotic policies in tight human-robot interaction loops. In Conference on Robot Learning, pages 212–224. PMLR, 2023.
  3. An optimistic perspective on offline reinforcement learning. In International Conference on Machine Learning, pages 104–114, 2020.
  4. Uncertainty-based offline reinforcement learning with diversified q-ensemble. Advances in Neural Information Processing Systems, 34:7436–7447, 2021.
  5. Hindsight experience replay. Advances in Neural Information Processing Systems, 30, 2017.
  6. Averaged-dqn: Variance reduction and stabilization for deep reinforcement learning. In International Conference on Machine Learning, pages 176–185, 2017.
  7. Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876–1902, 2009.
  8. Clustering on the unit hypersphere using von mises-fisher distributions. Journal of Machine Learning Research, 6(9), 2005.
  9. Christopher Bingham. An antipodally symmetric distribution on the sphere. The Annals of Statistics, pages 1201–1225, 1974.
  10. Frank Bowman. Introduction to Bessel functions. Courier Corporation, 2012.
  11. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
  12. Sample-efficient reinforcement learning with stochastic ensemble value expansion. Advances in Neural Information Processing Systems, 31, 2018.
  13. Ucb exploration via q-ensembles. arXiv preprint arXiv:1706.01502, 2017.
  14. Randomized ensembled double q-learning: Learning fast without a model. arXiv preprint arXiv:2101.05982, 2021.
  15. Better exploration with optimistic actor critic. Advances in Neural Information Processing Systems, 32, 2019.
  16. Estimating risk and uncertainty in deep reinforcement learning. arXiv preprint arXiv:1905.09638, 2019.
  17. Bayesian q-learning. AAAI, 1998:761–768, 1998.
  18. Ronald Aylmer Fisher. Dispersion on a sphere. Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences, 217(1130):295–305, 1953.
  19. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pages 1587–1596, 2018.
  20. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1861–1870, 2018.
  21. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018.
  22. Hado Hasselt. Double q-learning. Advances in Neural Information Processing Systems, 23, 2010.
  23. Uncertainty-driven imagination for continuous deep reinforcement learning. In Conference on Robot Learning, pages 195–206, 2017.
  24. John T Kent. The fisher-bingham distribution on the sphere. Journal of the Royal Statistical Society: Series B (Methodological), 44(1):71–80, 1982.
  25. Morel: Model-based offline reinforcement learning. Advances in Neural Information Processing Systems, 33:21810–21823, 2020.
  26. Maxmin q-learning: Controlling the estimation bias of q-learning. arXiv preprint arXiv:2002.06487, 2020.
  27. Sunrise: A simple unified framework for ensemble learning in deep reinforcement learning. In International Conference on Machine Learning, pages 6131–6141. PMLR, 2021.
  28. Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble. In Conference on Robot Learning, pages 1702–1712. PMLR, 2022.
  29. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  30. Ovd-explorer: Optimism should not be the sole pursuit of exploration in noisy environments. Proceedings of the AAAI Conference on Artificial Intelligence, 38(12):13954–13962, Mar. 2024.
  31. A review of uncertainty for deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, volume 18, pages 155–162, 2022.
  32. Directional Statistics, volume 2. Wiley Online Library, 2000.
  33. Deep exploration via bootstrapped dqn. Advances in Neural Information Processing Systems, 29, 2016.
  34. Randomized prior functions for deep reinforcement learning. Advances in Neural Information Processing Systems, 31, 2018.
  35. Multi-goal reinforcement learning: Challenging robotics environments and request for research. arXiv preprint arXiv:1802.09464, 2018.
  36. Real-world robot learning with masked visual pre-training. In Conference on Robot Learning, pages 416–426, 2023.
  37. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
  38. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  39. Robust opponent modeling via adversarial ensemble reinforcement learning. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 31, pages 578–587, 2021.
  40. Ensemble reinforcement learning: A survey. arXiv preprint arXiv:2303.02618, 2023.
  41. Suvrit Sra. A short note on parameter approximation for von mises-fisher distributions: and a fast implementation of IssubscriptI𝑠\text{I}_{s}I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT (x). Computational Statistics, 27:177–190, 2012.
  42. Reinforcement Learning: An Introduction. MIT press, 2018.
  43. Issues in using function approximation for reinforcement learning. In Proceedings of the Fourth Connectionist Models Summer School, volume 255, page 263. Hillsdale, NJ, 1993.
  44. Online robust reinforcement learning with model uncertainty. Advances in Neural Information Processing Systems, 34:7193–7206, 2021.
  45. Q-learning. Machine Learning, 8:279–292, 1992.
  46. Geoffrey S Watson. Distributions on the circle and sphere. Journal of Applied Probability, 19(A):265–280, 1982.
  47. Plan to predict: Learning an uncertainty-foreseeing model for model-based reinforcement learning. Advances in Neural Information Processing Systems, 35:15849–15861, 2022.
  48. Daydreamer: World models for physical robot learning. In Conference on Robot Learning, pages 2226–2240, 2023.
  49. Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668, 2021.
  50. Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142, 2020.
  51. Robust multi-agent reinforcement learning with model uncertainty. Advances in Neural Information Processing Systems, 33:10571–10583, 2020.
  52. Ensemble-based offline-to-online reinforcement learning: From pessimistic learning to optimistic exploration. CoRR, abs/2306.06871, 2023.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com