Generalized Munchausen Reinforcement Learning using Tsallis KL Divergence (2301.11476v4)
Abstract: Many policy optimization approaches in reinforcement learning incorporate a Kullback-Leilbler (KL) divergence to the previous policy, to prevent the policy from changing too quickly. This idea was initially proposed in a seminal paper on Conservative Policy Iteration, with approximations given by algorithms like TRPO and Munchausen Value Iteration (MVI). We continue this line of work by investigating a generalized KL divergence -- called the Tsallis KL divergence -- which use the $q$-logarithm in the definition. The approach is a strict generalization, as $q = 1$ corresponds to the standard KL divergence; $q > 1$ provides a range of new options. We characterize the types of policies learned under the Tsallis KL, and motivate when $q >1$ could be beneficial. To obtain a practical algorithm that incorporates Tsallis KL regularization, we extend MVI, which is one of the simplest approaches to incorporate KL regularization. We show that this generalized MVI($q$) obtains significant improvements over the standard MVI($q = 1$) across 35 Atari games.
- Dynamic policy programming. Journal of Machine Learning Research, 13(1):3207–3245, 2012.
- L. Baird and A. Moore. Gradient descent for general reinforcement learning. In Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II, page 968–974, 1999.
- The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47(1):253–279, 2013. ISSN 1076-9757.
- B. Belousov and J. Peters. Entropic regularization of markov decision processes. Entropy, 21(7), 2019.
- Learning with fenchel-young losses. Journal of Machine Learning Research, 21(35):1–69, 2020.
- Openai gym. arXiv preprint arXiv:1606.01540, 2016.
- Greedification operators for policy optimization: Investigating forward and reverse kl divergences. Journal of Machine Learning Research, 23(253):1–79, 2022.
- Effective exploration for deep reinforcement learning via bootstrapped q-ensembles under tsallis entropy regularization. arXiv:abs/1809.00403, 2018. URL http://arxiv.org/abs/1809.00403.
- Path consistency learning in Tsallis entropy regularized MDPs. In International Conference on Machine Learning, pages 979–988, 2018.
- L. Condat. Fast projection onto the simplex and the l1 ball. Mathematical Programming, 158:575–585, 2016.
- Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, USA, 2006.
- Distributional reinforcement learning with quantile regression. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, pages 2892–2899, 2018.
- Efficient projections onto the l1-ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning, page 272–279, 2008.
- Fundamental properties of tsallis relative entropy. Journal of Mathematical Physics, 45(12):4868–4877, 2004.
- Variational inference based on robust divergences. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84, pages 813–822, 2018.
- A theory of regularized Markov decission processes. In 36th International Conference on Machine Learning, volume 97, pages 2160–2169, 2019.
- A divergence minimization perspective on imitation learning methods. In Conference on Robot Learning, pages 1–19, 2019.
- Soft q-learning with mutual-information regularization. In International Conference on Learning Representations, pages 1–13, 2019.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, pages 1861–1870, 2018.
- J. Hiriart-Urruty and C. Lemaréchal. Fundamentals of Convex Analysis. Grundlehren Text Editions. Springer Berlin Heidelberg, 2004.
- Imitation learning as f𝑓fitalic_f-divergence minimization, 2019. URL https://arxiv.org/abs/1905.12888.
- Geometric value iteration: Dynamic error-aware kl regularization for reinforcement learning. In Proceedings of The 13th Asian Conference on Machine Learning, volume 157, pages 918–931, 2021.
- Theoretical analysis of efficiency and robustness of softmax and gap-increasing operators in reinforcement learning. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89, pages 2995–3003, 2019.
- Kl-entropy-regularized rl with a generative model is minimax optimal, 2022. URL https://arxiv.org/abs/2205.14211.
- Sparse markov decision processes with causal sparse tsallis entropy regularization for reinforcement learning. IEEE Robotics and Automation Letters, 3:1466–1473, 2018.
- Generalized tsallis entropy reinforcement learning and its application to soft mobile robots. In Robotics: Science and Systems XVI, pages 1–10, 2020.
- Y. Li and R. E. Turner. Rényi divergence variational inference. In Advances in Neural Information Processing Systems, volume 29, 2016.
- From softmax to sparsemax: A sparse model of attention and multi-label classification. In Proceedings of the 33rd International Conference on Machine Learning, page 1614–1623, 2016.
- O. Nachum and B. Dai. Reinforcement learning via fenchel-rockafellar duality. 2020. URL http://arxiv.org/abs/2001.01866.
- Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074, 2019.
- J. Naudts. Deformed exponentials and logarithms in generalized thermostatistics. Physica A-statistical Mechanics and Its Applications, 316:323–334, 2002.
- f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, volume 29, pages 1–9, 2016.
- Tsallis relative entropy and anomalous diffusion. Entropy, 14(4):701–716, 2012.
- Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22(268):1–8, 2021.
- I. Sason and S. Verdú. f-divergence inequalities. IEEE Transactions on Information Theory, 62:5973–6006, 2016.
- H. Suyari and M. Tsukada. Law of error in tsallis statistics. IEEE Transactions on Information Theory, 51(2):753–757, 2005.
- Advantages of q-logarithm representation over q-exponential representation from the sense of scale and shift on nonlinear systems. The European Physical Journal Special Topics, 229(5):773–785, 2020.
- C. Tsallis. Possible generalization of boltzmann-gibbs statistics. Journal of Statistical Physics, 52:479–487, 1988.
- C. Tsallis. Introduction to Nonextensive Statistical Mechanics: Approaching a Complex World. Springer New York, 2009. ISBN 9780387853581.
- Leverage the average: an analysis of regularization in rl. In Advances in Neural Information Processing Systems 33, pages 1–12, 2020a.
- Munchausen reinforcement learning. In Advances in Neural Information Processing Systems 33, pages 1–11. 2020b.
- f-divergence variational inference. In Advances in Neural Information Processing Systems, volume 33, pages 17370–17379, 2020.
- Variational inference with tail-adaptive f-divergence. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 5742–5752, 2018.
- T. Yamano. Some properties of q-logarithm and q-exponential functions in tsallis statistics. Physica A: Statistical Mechanics and its Applications, 305(3):486–496, 2002.
- Training deep energy-based models with f-divergence minimization. In Proceedings of the 37th International Conference on Machine Learning, ICML’20, pages 1–11, 2020.
- Gendice: Generalized offline estimation of stationary values. In International Conference on Learning Representations, 2020.