Increasing Entropy to Boost Policy Gradient Performance on Personalization Tasks (2310.05324v1)
Abstract: In this effort, we consider the impact of regularization on the diversity of actions taken by policies generated from reinforcement learning agents trained using a policy gradient. Policy gradient agents are prone to entropy collapse, which means certain actions are seldomly, if ever, selected. We augment the optimization objective function for the policy with terms constructed from various $\varphi$-divergences and Maximum Mean Discrepancy which encourages current policies to follow different state visitation and/or action choice distribution than previously computed policies. We provide numerical experiments using MNIST, CIFAR10, and Spotify datasets. The results demonstrate the advantage of diversity-promoting policy regularization and that its use on gradient-based approaches have significantly improved performance on a variety of personalization tasks. Furthermore, numerical evidence is given to show that policy regularization increases performance without losing accuracy.
- R. Burke, A. Felfernig, and M. H. Göker, “Recommender systems: An overview,” AI Magazine, vol. 32, no. 3, pp. 13–18, Jun. 2011. [Online]. Available: https://ojs.aaai.org/index.php/aimagazine/article/view/2361
- C. A. Gomez-Uribe and N. Hunt, “The netflix recommender system: Algorithms, business value, and innovation,” ACM Transactions on Management Information Systems (TMIS), vol. 6, no. 4, 2016. [Online]. Available: https://doi.org/10.1145/2843948
- K. Jacobson, V. Murali, E. Newett, B. Whitman, and R. Yon, “Music personalization at spotify,” in Proceedings of the 10th ACM Conference on Recommender Systems, ser. RecSys ’16. New York, NY, USA: Association for Computing Machinery, 2016, p. 373. [Online]. Available: https://doi.org/10.1145/2959100.2959120
- B. Smith and G. Linden, “Two decades of recommender systems at amazon.com,” IEEE Internet Computing, vol. 21, no. 3, pp. 12–18, 2017.
- X. Wang, Y. Wang, D. Hsu, and Y. Wang, “Exploration in interactive personalized music recommendation: A reinforcement learning approach,” ACM Trans. Multimedia Comput. Commun. Appl., vol. 11, no. 1, Sep. 2014. [Online]. Available: https://doi.org/10.1145/2623372
- T. N. T. Tran, A. Felfernig, C. Trattner, and A. Holzinger, “Recommender systems in the healthcare domain: state-of-the-art and research issues,” Journal of Intelligent Information Systems, vol. 57, no. 1, pp. 171–201, 2021.
- S. Ferretti, S. Mirri, C. Prandi, and P. Salomoni, “Automatic web content personalization through reinforcement learning,” Journal of Systems and Software, vol. 121, pp. 157–169, 2016. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0164121216000443
- F. Ricci, L. Rokach, and B. Shapira, “Introduction to recommender systems handbook,” in Recommender Systems Handbook, 2011.
- L. Lasalvia, “Personalization and standardization: Can we have it all?” Journal of Precision Medicine— Volume, vol. 6, no. 1, 2020.
- A. Vatian, S. Dudorov, A. Ivchenko, K. Smirnov, E. Chikshova, A. Lobantsev, V. Parfenov, A. Shalyto, and N. Gusarova, “Design patterns for personalization of healthcare process,” in Proceedings of the 2019 2nd International Conference on Geoinformatics and Data Analysis, ser. ICGDA 2019. New York, NY, USA: Association for Computing Machinery, 2019, p. 83–88. [Online]. Available: https://doi.org/10.1145/3318236.3318249
- H. Lei, A. Tewari, and S. A. Murphy, “An actor-critic contextual bandit algorithm for personalized mobile health interventions,” arXiv preprint arXiv:1706.09090, 2017.
- F. Zhu, J. Guo, R. Li, and J. Huang, “Robust actor-critic contextual bandit for mobile health (mhealth) interventions,” in Proceedings of the 2018 acm international conference on bioinformatics, computational biology, and health informatics, 2018, pp. 492–501.
- A. e. Hassouni, M. Hoogendoorn, M. v. Otterlo, and E. Barbaro, “Personalization of health interventions using cluster-based reinforcement learning,” in International Conference on Principles and Practice of Multi-Agent Systems. Springer, 2018, pp. 467–475.
- C. Tan, R. Han, R. Ye, and K. Chen, “Adaptive learning recommendation strategy based on deep q-learning,” Applied psychological measurement, vol. 44, no. 4, pp. 251–266, 2020.
- M. Aspinall and R. Hamermesh, “Realizing the promise of personalized medicine,” Harvard business review, vol. 85, pp. 108–17, 165, 11 2007.
- G. S. Ginsburg and J. J. McCarthy, “Personalized medicine: revolutionizing drug discovery and patient care,” Trends in Biotechnology, vol. 19, no. 12, pp. 491–496, 2001. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167779901018145
- F. den Hengst, E. Grua, A. el Hassouni, and M. Hoogendoorn, “Reinforcement learning for personalization: A systematic literature review,” Data Science, vol. 3, pp. 1–41, 04 2020.
- A. Dereventsov, C. G. Webster, and J. Daws, “An adaptive stochastic gradient-free approach for high-dimensional blackbox optimization,” Montreal, Canada, 2022.
- T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, and S. Levine, “Soft actor-critic algorithms and applications,” CoRR, vol. abs/1812.05905, 2018. [Online]. Available: http://arxiv.org/abs/1812.05905
- T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80. PMLR, 10–15 Jul 2018, pp. 1861–1870. [Online]. Available: https://proceedings.mlr.press/v80/haarnoja18b.html
- T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforcement learning with deep energy-based policies,” in Proceedings of the 34th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, D. Precup and Y. W. Teh, Eds., vol. 70. PMLR, 06–11 Aug 2017, pp. 1352–1361. [Online]. Available: https://proceedings.mlr.press/v70/haarnoja17a.html
- I. Csiszar, “I𝐼Iitalic_I-Divergence Geometry of Probability Distributions and Minimization Problems,” The Annals of Probability, vol. 3, no. 1, pp. 146 – 158, 1975. [Online]. Available: https://doi.org/10.1214/aop/1176996454
- A. Genevay, “Entropy-Regularized Optimal Transport for Machine Learning,” Theses, PSL University, Mar. 2019. [Online]. Available: https://theses.hal.science/tel-02319318
- A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. Smola, “A kernel method for the two-sample-problem,” in Advances in Neural Information Processing Systems, B. Schölkopf, J. Platt, and T. Hoffman, Eds., vol. 19. MIT Press, 2006. [Online]. Available: https://proceedings.neurips.cc/paper/2006/file/e9fb2eda3d9c55a0d89c98d6c54b5b3e-Paper.pdf
- A. Müller, “Integral probability metrics and their generating classes of functions,” Advances in Applied Probability, vol. 29, no. 2, p. 429–443, 1997.
- C.-L. Li, W.-C. Chang, Y. Cheng, Y. Yang, and B. Póczos, “Mmd gan: Towards deeper understanding of moment matching network,” in NIPS, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:4685015
- M. Binkowski, D. J. Sutherland, M. Arbel, and A. Gretton, “Demystifying mmd gans,” ArXiv, vol. abs/1801.01401, 2018. [Online]. Available: https://api.semanticscholar.org/CorpusID:3531856
- R. J. Williams and J. Peng, “Function optimization using connectionist reinforcement learning algorithms,” Connection Science, vol. 3, no. 3, pp. 241–268, 1991.
- V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International conference on machine learning. PMLR, 2016, pp. 1928–1937.
- Z. Ahmed, N. Le Roux, M. Norouzi, and D. Schuurmans, “Understanding the impact of entropy on policy optimization,” in International conference on machine learning. PMLR, 2019, pp. 151–160.
- A. Galashov, S. M. Jayakumar, L. Hasenclever, D. Tirumala, J. Schwarz, G. Desjardins, W. M. Czarnecki, Y. W. Teh, R. Pascanu, and N. Heess, “Information asymmetry in kl-regularized rl,” arXiv preprint arXiv:1905.01240, 2019.
- Q. Wang, Y. Li, J. Xiong, and T. Zhang, “Divergence-augmented policy optimization,” Advances in Neural Information Processing Systems, vol. 32, 2019.
- Z.-W. Hong, T.-Y. Shann, S.-Y. Su, Y.-H. Chang, T.-J. Fu, and C.-Y. Lee, “Diversity-driven exploration strategy for deep reinforcement learning,” in Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31. Curran Associates, Inc., 2018. [Online]. Available: https://proceedings.neurips.cc/paper/2018/file/a2802cade04644083dcde1c8c483ed9a-Paper.pdf
- M. A. Masood and F. Doshi-Velez, “Diversity-inducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies,” in IJCAI, 2019, pp. 5923–5929. [Online]. Available: https://doi.org/10.24963/ijcai.2019/821
- J. Langford and T. Zhang, “The epoch-greedy algorithm for multi-armed bandits with side information,” Advances in neural information processing systems, vol. 20, 2007.
- L. Li, W. Chu, J. Langford, and R. E. Schapire, “A contextual-bandit approach to personalized news article recommendation,” in Proceedings of the 19th international conference on World wide web, 2010, pp. 661–670.
- L. Tang, Y. Jiang, L. Li, C. Zeng, and T. Li, “Personalized recommendation via parameter-free contextual bandits,” in Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, 2015, pp. 323–332.
- Z. Dou, R. Song, J.-R. Wen, and X. Yuan, “Evaluating the effectiveness of personalized web search,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 8, pp. 1178–1190, 2008.
- J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International conference on machine learning. PMLR, 2015, pp. 1889–1897.
- J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
- M. Dudík, J. Langford, and L. Li, “Doubly robust policy evaluation and learning,” in Proceedings of the 28th International Conference on International Conference on Machine Learning, 2011, pp. 1097–1104.
- A. Swaminathan and T. Joachims, “Counterfactual risk minimization: Learning from logged bandit feedback,” in International Conference on Machine Learning. PMLR, 2015, pp. 814–823.
- M. Chen, R. Gummadi, C. Harris, and D. Schuurmans, “Surrogate objectives for batch policy optimization in one-step decision making,” Advances in Neural Information Processing Systems, vol. 32, 2019.
- A. Dereventsov, A. Starnes, and C. G. Webster, “Examining policy entropy of reinforcement learning agents for personalization tasks,” arXiv, 2022, submitted.
- L. Kantorovich, “On the transfer of masses (in russian),” Doklady Akademii Nauk, vol. 2, pp. 227–229, 1942.
- ——, “On the translocation of masses,” Journal of Mathematical Sciences, vol. 133, pp. 1381–1382, 2006.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.