Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Generalized Munchausen Reinforcement Learning using Tsallis KL Divergence (2301.11476v4)

Published 27 Jan 2023 in cs.LG and cs.AI

Abstract: Many policy optimization approaches in reinforcement learning incorporate a Kullback-Leilbler (KL) divergence to the previous policy, to prevent the policy from changing too quickly. This idea was initially proposed in a seminal paper on Conservative Policy Iteration, with approximations given by algorithms like TRPO and Munchausen Value Iteration (MVI). We continue this line of work by investigating a generalized KL divergence -- called the Tsallis KL divergence -- which use the $q$-logarithm in the definition. The approach is a strict generalization, as $q = 1$ corresponds to the standard KL divergence; $q > 1$ provides a range of new options. We characterize the types of policies learned under the Tsallis KL, and motivate when $q >1$ could be beneficial. To obtain a practical algorithm that incorporates Tsallis KL regularization, we extend MVI, which is one of the simplest approaches to incorporate KL regularization. We show that this generalized MVI($q$) obtains significant improvements over the standard MVI($q = 1$) across 35 Atari games.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Dynamic policy programming. Journal of Machine Learning Research, 13(1):3207–3245, 2012.
  2. L. Baird and A. Moore. Gradient descent for general reinforcement learning. In Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II, page 968–974, 1999.
  3. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47(1):253–279, 2013. ISSN 1076-9757.
  4. B. Belousov and J. Peters. Entropic regularization of markov decision processes. Entropy, 21(7), 2019.
  5. Learning with fenchel-young losses. Journal of Machine Learning Research, 21(35):1–69, 2020.
  6. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
  7. Greedification operators for policy optimization: Investigating forward and reverse kl divergences. Journal of Machine Learning Research, 23(253):1–79, 2022.
  8. Effective exploration for deep reinforcement learning via bootstrapped q-ensembles under tsallis entropy regularization. arXiv:abs/1809.00403, 2018. URL http://arxiv.org/abs/1809.00403.
  9. Path consistency learning in Tsallis entropy regularized MDPs. In International Conference on Machine Learning, pages 979–988, 2018.
  10. L. Condat. Fast projection onto the simplex and the l1 ball. Mathematical Programming, 158:575–585, 2016.
  11. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, USA, 2006.
  12. Distributional reinforcement learning with quantile regression. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, pages 2892–2899, 2018.
  13. Efficient projections onto the l1-ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning, page 272–279, 2008.
  14. Fundamental properties of tsallis relative entropy. Journal of Mathematical Physics, 45(12):4868–4877, 2004.
  15. Variational inference based on robust divergences. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84, pages 813–822, 2018.
  16. A theory of regularized Markov decission processes. In 36th International Conference on Machine Learning, volume 97, pages 2160–2169, 2019.
  17. A divergence minimization perspective on imitation learning methods. In Conference on Robot Learning, pages 1–19, 2019.
  18. Soft q-learning with mutual-information regularization. In International Conference on Learning Representations, pages 1–13, 2019.
  19. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, pages 1861–1870, 2018.
  20. J. Hiriart-Urruty and C. Lemaréchal. Fundamentals of Convex Analysis. Grundlehren Text Editions. Springer Berlin Heidelberg, 2004.
  21. Imitation learning as f𝑓fitalic_f-divergence minimization, 2019. URL https://arxiv.org/abs/1905.12888.
  22. Geometric value iteration: Dynamic error-aware kl regularization for reinforcement learning. In Proceedings of The 13th Asian Conference on Machine Learning, volume 157, pages 918–931, 2021.
  23. Theoretical analysis of efficiency and robustness of softmax and gap-increasing operators in reinforcement learning. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89, pages 2995–3003, 2019.
  24. Kl-entropy-regularized rl with a generative model is minimax optimal, 2022. URL https://arxiv.org/abs/2205.14211.
  25. Sparse markov decision processes with causal sparse tsallis entropy regularization for reinforcement learning. IEEE Robotics and Automation Letters, 3:1466–1473, 2018.
  26. Generalized tsallis entropy reinforcement learning and its application to soft mobile robots. In Robotics: Science and Systems XVI, pages 1–10, 2020.
  27. Y. Li and R. E. Turner. Rényi divergence variational inference. In Advances in Neural Information Processing Systems, volume 29, 2016.
  28. From softmax to sparsemax: A sparse model of attention and multi-label classification. In Proceedings of the 33rd International Conference on Machine Learning, page 1614–1623, 2016.
  29. O. Nachum and B. Dai. Reinforcement learning via fenchel-rockafellar duality. 2020. URL http://arxiv.org/abs/2001.01866.
  30. Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074, 2019.
  31. J. Naudts. Deformed exponentials and logarithms in generalized thermostatistics. Physica A-statistical Mechanics and Its Applications, 316:323–334, 2002.
  32. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, volume 29, pages 1–9, 2016.
  33. Tsallis relative entropy and anomalous diffusion. Entropy, 14(4):701–716, 2012.
  34. Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22(268):1–8, 2021.
  35. I. Sason and S. Verdú. f-divergence inequalities. IEEE Transactions on Information Theory, 62:5973–6006, 2016.
  36. H. Suyari and M. Tsukada. Law of error in tsallis statistics. IEEE Transactions on Information Theory, 51(2):753–757, 2005.
  37. Advantages of q-logarithm representation over q-exponential representation from the sense of scale and shift on nonlinear systems. The European Physical Journal Special Topics, 229(5):773–785, 2020.
  38. C. Tsallis. Possible generalization of boltzmann-gibbs statistics. Journal of Statistical Physics, 52:479–487, 1988.
  39. C. Tsallis. Introduction to Nonextensive Statistical Mechanics: Approaching a Complex World. Springer New York, 2009. ISBN 9780387853581.
  40. Leverage the average: an analysis of regularization in rl. In Advances in Neural Information Processing Systems 33, pages 1–12, 2020a.
  41. Munchausen reinforcement learning. In Advances in Neural Information Processing Systems 33, pages 1–11. 2020b.
  42. f-divergence variational inference. In Advances in Neural Information Processing Systems, volume 33, pages 17370–17379, 2020.
  43. Variational inference with tail-adaptive f-divergence. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 5742–5752, 2018.
  44. T. Yamano. Some properties of q-logarithm and q-exponential functions in tsallis statistics. Physica A: Statistical Mechanics and its Applications, 305(3):486–496, 2002.
  45. Training deep energy-based models with f-divergence minimization. In Proceedings of the 37th International Conference on Machine Learning, ICML’20, pages 1–11, 2020.
  46. Gendice: Generalized offline estimation of stationary values. In International Conference on Learning Representations, 2020.
Citations (1)

Summary

We haven't generated a summary for this paper yet.