Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

S$^2$AC: Energy-Based Reinforcement Learning with Stein Soft Actor Critic (2405.00987v1)

Published 2 May 2024 in cs.LG

Abstract: Learning expressive stochastic policies instead of deterministic ones has been proposed to achieve better stability, sample complexity, and robustness. Notably, in Maximum Entropy Reinforcement Learning (MaxEnt RL), the policy is modeled as an expressive Energy-Based Model (EBM) over the Q-values. However, this formulation requires the estimation of the entropy of such EBMs, which is an open problem. To address this, previous MaxEnt RL methods either implicitly estimate the entropy, resulting in high computational complexity and variance (SQL), or follow a variational inference procedure that fits simplified actor distributions (e.g., Gaussian) for tractability (SAC). We propose Stein Soft Actor-Critic (S$2$AC), a MaxEnt RL algorithm that learns expressive policies without compromising efficiency. Specifically, S$2$AC uses parameterized Stein Variational Gradient Descent (SVGD) as the underlying policy. We derive a closed-form expression of the entropy of such policies. Our formula is computationally efficient and only depends on first-order derivatives and vector products. Empirical results show that S$2$AC yields more optimal solutions to the MaxEnt objective than SQL and SAC in the multi-goal environment, and outperforms SAC and SQL on the MuJoCo benchmark. Our code is available at: https://github.com/SafaMessaoud/S2AC-Energy-Based-RL-with-Stein-Soft-Actor-Critic

Definition Search Book Streamline Icon: https://streamlinehq.com
References (100)
  1. Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920, 2018.
  2. Deep reinforcement learning at the edge of the statistical precipice. NeurIPS, 2021.
  3. I. Ahmad and Pi-Erh Lin. A nonparametric estimation of the entropy for absolutely continuous distributions (corresp.). IEEE Trans. Inf. Theory, 1976.
  4. Jan Beirlant and M.C.A van Zuijlen. The empirical distribution function and strong laws for functions of order statistics of uniform spacings. J. Multivar. Anal., 1985.
  5. Nonparametric entropy estimation: An overview. IJSRMSS, 1997.
  6. Ranking of the best random number generators via entropy-uniformity theory. AJMMS, 1996.
  7. Sums of Functions of Nearest Neighbor Distances, Moment Bounds, Limit Theorems and a Goodness of Fit Test. Ann. Probab., 1983.
  8. Openai Gym. arXiv preprint arXiv:1606.01540, 2016.
  9. Solving the quantum many-body problem with artificial neural networks. Science, 2017.
  10. Stein variational goal generation for adaptive exploration in multi-goal reinforcement learning. 2023.
  11. Policy gradient with serial markov chain reasoning. NeurIPS, 2022.
  12. Variational automatic curriculum learning for sparse-reward cooperative multi-agent problems. NeurIPS, 2021.
  13. Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999.
  14. Noel Cressie. Power results for tests based on high-order gaps. Biometrika, 1978.
  15. Kernel exponential family estimation via doubly dual embedding. In AISTAT, 2019a.
  16. Exponential family estimation via adversarial dynamics embedding. NeurIPS, 2019b.
  17. Calibrating energy-based generative adversarial networks. arXiv preprint arXiv:1702.01691, 2017.
  18. Modern mathematical statistics with applications. Springer, 2012.
  19. Benchmarking deep reinforcement learning for continuous control. In ICML, 2016.
  20. Edward J Dudewicz and Edward C Van Der Meulen. Entropy-based tests of uniformity. JASA, 1981.
  21. Maximum entropy rl (provably) solves some robust rl problems. In ICLR, 2022.
  22. Learning to draw samples with amortized stein variational gradient descent. arXiv preprint arXiv:1707.06626, 2017.
  23. Guided cost learning: Deep inverse optimal control via policy optimization. In ICML, 2016.
  24. Learning energy-based models by diffusion recovery likelihood. arXiv preprint arXiv:2012.08125, 2020.
  25. Efficient representation of quantum many-body states with deep neural networks. Nature communications, 2017.
  26. Meta-learning for stochastic gradient mcmc. arXiv preprint arXiv:1806.04522, 2018.
  27. Q-prop: Sample-efficient policy gradient with an off-policy critic. In ICLR, 2017.
  28. László Györfi and Edward C. van der Meulen. Density-free convergence properties of various estimators of entropy. CSDA, 1987.
  29. Tuomas Haarnoja. Acquiring diverse robot skills via maximum entropy deep reinforcement learning (Ph.D. thesis). University of California, Berkeley, 2018.
  30. Reinforcement learning with deep energy-based policies. In ICML, 2017.
  31. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In ICML, 2018a.
  32. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018b.
  33. Peter Hall. On powerful distributional tests based on sample spacings. JMVA, 1986.
  34. A conditional entropy minimization criterion for dimensionality reduction and multiple kernel learning. Neural Comput., 2010.
  35. The No-U-Turn sampler: Adaptively setting path lengths in hamiltonian monte carlo. JMLR, 2014.
  36. Aleksandr Vasil’evich Ivanov and MN Rozhkova. On properties of the statistical estimate of the entropy of a random vector with a probability density. Problemy Peredachi Informatsii, 1981.
  37. Harry Joe. Estimation of entropy and other functionals of a multivariate density. Ann. Inst. Stat. Math., 1989.
  38. Hilbert J Kappen. Path integrals and symmetry breaking for optimal control theory. JSTAT, 2005.
  39. Normalizing flows: An introduction and review of current methods. PAMI, 2020.
  40. Sample estimate of the entropy of a random vector. Problemy Peredachi Informatsii, 1987.
  41. Engan: Latent space mcmc and maximum entropy generators for energy-based models. ICLR, 2019.
  42. Reinforcement learning in continuous action spaces through sequential monte carlo methods. NeurIPS, 2007.
  43. Erik G Learned-Miller and John W Fisher III. Ica using spacings estimates of entropy. JMLR, 2003.
  44. A tutorial on energy-based learning. Predicting Structured Data, 2006.
  45. Deep reinforcement learning in continuous action spaces: a case study in the game of simulated curling. In ICML, 2018.
  46. Guided policy search. In ICML, 2013.
  47. Generalizing hamiltonian monte carlo with neural networks. arXiv preprint arXiv:1711.09268, 2017.
  48. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  49. Combating mode collapse in gans via manifold entropy estimation. AAAI, 2022.
  50. Qiang Liu. Stein variational gradient descent as gradient flow. NeurIPS, 2017.
  51. Stein variational gradient descent: a general purpose bayesian inference algorithm. In NeurIPS, 2016.
  52. Stein variational policy gradient. UAI, 2017.
  53. Matrix differential calculus with applications in statistics and econometrics. John Wiley & Sons, 2019.
  54. The cross entropy method for classification. In ICML, 2005.
  55. Iterative amortized policy optimization. NeurIPS, 2021.
  56. Leveraging exploration in off-policy algorithms via normalizing flows. In CoRL, 2020.
  57. Restricted boltzmann machines in quantum physics. Nature Physics, 2019.
  58. Safa Messaoud. Toward More Scalable Structured Models. PhD thesis, University of Illinois Urbana-Champaign, 2021.
  59. Structural consistency and controllability for diverse colorization. In ECCV, pp.  596–612, 2018.
  60. Can we learn heuristics for graphical model inference using reinforcement learning? In CVPR Workshops, pp.  766–767, 2020.
  61. Human-level control through deep reinforcement learning. Nature, 2015.
  62. Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2011.
  63. Combining policy gradient and q-learning. arXiv preprint arXiv:1611.01626, 2016.
  64. Trajectory prediction with latent belief energy-based model. In CVPR, pp.  11814–11824, 2021.
  65. Liam Paninski. Estimation of entropy and mutual information. Neural Comput., 2003.
  66. Fernando Pérez-Cruz. Estimation of information theoretic measures for continuous random variables. In NeurIPS, 2008.
  67. On stochastic optimal control and reinforcement learning by approximate inference. Proceedings of Robotics: Science and Systems VIII, 2012.
  68. Variational inference with normalizing flows. In ICML. PMLR, 2015.
  69. Quantifying entropy production in active fluctuations of the hair-cell bundle from time irreversibility and uncertainty relations. New J. Phys., 2021.
  70. The Cross Entropy Method: A Unified Approach To Combinatorial Optimization, Monte-Carlo Simulation (Information Science and Statistics). Springer-Verlag, 2004.
  71. Restricted boltzmann machines for collaborative filtering. In ICML, 2007.
  72. Trust region policy optimization. In ICML, 2015.
  73. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  74. Claude Elwood Shannon. A mathematical theory of communication. ACM SIGMOBILE, 2001.
  75. Deterministic policy gradient algorithms. In ICML, 2014.
  76. Policy gradient methods for reinforcement learning with function approximation. NeurIPS, 1999.
  77. Boosting trust region policy optimization by normalizing flows policy. arXiv preprint arXiv:1809.10326, 2018.
  78. F.P. Tarasenko. On the evaluation of an unknown probability density function, the direct estimation of the entropy from independent observations of a continuous random variable, and the distribution-free entropy test of goodness-of-fit. Proceedings of the IEEE, 1968.
  79. A generalized path integral control approach to reinforcement learning. JMLR, 2010.
  80. Emanuel Todorov. Linearly-solvable markov decision problems. In NeurIPS, 2006.
  81. Neural-network quantum state tomography. Nature Physics, 2018.
  82. Marc Toussaint. Robot trajectory optimization using approximate inference. In ICML, 2009.
  83. A simple introduction to markov chain monte–carlo sampling. Psychon. Bull. Rev., 2018.
  84. Oldrich Vasicek. A Test for Normality Based on Sample Entropy. JSTOR, 2015.
  85. Munchausen reinforcement learning. In NeurIPS, 2020.
  86. Learning to draw samples: With application to amortized mle for generative adversarial learning. arXiv preprint arXiv:1611.01722, 2016.
  87. Bayesian learning via stochastic gradient langevin dynamics. In ICML, 2011.
  88. Sparse and deep generalizations of the frame model. Annals of Mathematical Sciences and Applications, 2018.
  89. Maximum entropy deep inverse reinforcement learning. arXiv preprint arXiv:1507.04888, 2015.
  90. Cooperative training of descriptor and generator networks. IEEE PAMI.
  91. A theory of generative convnet. In ICML, 2016.
  92. Generative voxelnet: learning energy-based models for 3d shape synthesis and analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(5), 2020.
  93. Learning cycle-consistent cooperative networks via alternating mcmc teaching for unsupervised cross-domain translation. In AAAI, volume 35, pp.  10430–10440, 2021a.
  94. Learning energy-based model with variational auto-encoder as amortized sampler. In AAAI, 2021b.
  95. A tale of two flows: Cooperative learning of langevin flow and normalizing flow toward energy-based model. arXiv preprint arXiv:2205.06924, 2022.
  96. Learning energy-based generative models via coarse-to-fine expanding and sampling. In ICLR, 2021.
  97. Patchwise generative convnet: Training energy-based models from a single natural image for internal learning. In CVPR, pp.  2961–2970, June 2021.
  98. Mo Zhou and Jianfeng Lu. Single timescale actor-critic method to solve the linear quadratic regulator with convergence guarantees. JMLR, 2023.
  99. Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University, 2010.
  100. Maximum entropy inverse reinforcement learning. In AAAI, 2008.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com