Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Symmetric Q-learning: Reducing Skewness of Bellman Error in Online Reinforcement Learning (2403.07704v1)

Published 12 Mar 2024 in cs.LG and cs.AI

Abstract: In deep reinforcement learning, estimating the value function to evaluate the quality of states and actions is essential. The value function is often trained using the least squares method, which implicitly assumes a Gaussian error distribution. However, a recent study suggested that the error distribution for training the value function is often skewed because of the properties of the BeLLMan operator, and violates the implicit assumption of normal error distribution in the least squares method. To address this, we proposed a method called Symmetric Q-learning, in which the synthetic noise generated from a zero-mean distribution is added to the target values to generate a Gaussian error distribution. We evaluated the proposed method on continuous control benchmark tasks in MuJoCo. It improved the sample efficiency of a state-of-the-art reinforcement learning method by reducing the skewness of the error distribution.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Bishop, C. M. 2006. Pattern Recognition and Machine Learning. Springer. ISBN 978-0387-31073-2.
  2. An analysis of transformations. Journal of the Royal Statistical Society: Series B (Methodological), 26(2): 211–243.
  3. OpenAI Gym. arXiv:1606.01540.
  4. Randomized Ensembled Double Q-Learning: Learning Fast Without a Model. In International Conference on Learning Representations.
  5. Constable, C. G. 1988. Parameter estimation in non-Gaussian noise. Geophysical Journal International, 94(1): 131–142.
  6. Limiting forms of the frequency distribution of the largest or smallest member of a sample. Mathematical Proceedings of the Cambridge Philosophical Society, 24(2): 180–190.
  7. Addressing Function Approximation Error in Actor-Critic Methods. In Dy, J.; and Krause, A., eds., Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, 1587–1596. PMLR.
  8. Extreme Q-Learning: MaxEnt RL without Entropy. In International Conference on Learning Representations.
  9. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Dy, J.; and Krause, A., eds., Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, 1861–1870. PMLR.
  10. Hasselt, H. 2010. Double Q-learning. In Lafferty, J.; Williams, C.; Shawe-Taylor, J.; Zemel, R.; and Culotta, A., eds., Advances in Neural Information Processing Systems, volume 23. Curran Associates, Inc.
  11. MEPG: A Minimalist Ensemble Policy Gradient Framework for Deep Reinforcement Learning. arXiv:2109.10552.
  12. Dropout Q-Functions for Doubly Efficient Reinforcement Learning. In International Conference on Learning Representations.
  13. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations. San Diega, CA, USA.
  14. Playing Atari with Deep Reinforcement Learning. arXiv:1312.5602.
  15. Human-level control through deep reinforcement learning. Nature, 518(7540): 529–533.
  16. Mood, A. M. 1950. Introduction to the theory of statistics. McGraw-Hill.
  17. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12: 2825–2830.
  18. On-Line Q-Learning Using Connectionist Systems. Technical Report CUED/F-INFENG/TR 166.
  19. Trust Region Policy Optimization. In Bach, F.; and Blei, D., eds., Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, 1889–1897. Lille, France: PMLR.
  20. Proximal Policy Optimization Algorithms. arXiv:1707.06347.
  21. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 15(56): 1929–1958.
  22. MuJoCo: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 5026–5033.
  23. Deep Reinforcement Learning with Double Q-Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1).
  24. Q-learning. Machine learning, 8: 279–292.
  25. Feasibility study on the least square method for fitting non-Gaussian noise data. Physica A: Statistical Mechanics and its Applications, 492: 1917–1930.
  26. A new family of power transformations to improve normality or symmetry. Biometrika, 87(4): 954–959.

Summary

We haven't generated a summary for this paper yet.