Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Regret Bounds and Reinforcement Learning Exploration of EXP-based Algorithms (2009.09538v3)

Published 20 Sep 2020 in cs.LG, cs.AI, and stat.ML

Abstract: We study the challenging exploration incentive problem in both bandit and reinforcement learning, where the rewards are scale-free and potentially unbounded, driven by real-world scenarios and differing from existing work. Past works in reinforcement learning either assume costly interactions with an environment or propose algorithms finding potentially low quality local maxima. Motivated by EXP-type methods that integrate multiple agents (experts) for exploration in bandits with the assumption that rewards are bounded, we propose new algorithms, namely EXP4.P and EXP4-RL for exploration in the unbounded reward case, and demonstrate their effectiveness in these new settings. Unbounded rewards introduce challenges as the regret cannot be limited by the number of trials, and selecting suboptimal arms may lead to infinite regret. Specifically, we establish EXP4.P's regret upper bounds in both bounded and unbounded linear and stochastic contextual bandits. Surprisingly, we also find that by including one sufficiently competent expert, EXP4.P can achieve global optimality in the linear case. This unbounded reward result is also applicable to a revised version of EXP3.P in the Multi-armed Bandit scenario. In EXP4-RL, we extend EXP4.P from bandit scenarios to reinforcement learning to incentivize exploration by multiple agents, including one high-performing agent, for both efficiency and excellence. This algorithm has been tested on difficult-to-explore games and shows significant improvements in exploration compared to state-of-the-art.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Corralling a band of bandit algorithms. In Conference on Learning Theory, pages 12–38. PMLR, 2017.
  2. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
  3. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002.
  4. M. F. Balcan. 8803 machine learning theory. http://cs.cmu.edu/~ninamf/ML11/lect1117.pdf, 2011.
  5. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pages 1471–1479, 2016.
  6. Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 19–26, 2011.
  7. Set partitioning via inclusion-exclusion. SIAM Journal on Computing, 39(2):546–563, 2009.
  8. Exploration by random network distillation. In International Conference on Learning Representations, 2018.
  9. S. Chatterjee. Superconcentration and related topics, volume 15. Cham: Springer, 2014.
  10. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214. JMLR Workshop and Conference Proceedings, 2011.
  11. Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on Computer Vision and Pattern Recognition, pages 248–255. IEEE, 2009.
  12. The total variation distance between high-dimensional gaussians. arXiv preprint arXiv:1810.08693, 2018.
  13. J. Duchi. Probability bounds. http://ai.stanford.edu/~jduchi/projects/probability_bounds.pdf, 2009.
  14. Dora the explorer: directed outreaching reinforcement action-selection. In International Conference on Learning Representations, 2018.
  15. Regret bounds for gaussian process bandit problems. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 273–280, 2010.
  16. The power of exploiter: Provable multi-agent rl in large state spaces. In International Conference on Machine Learning, pages 10251–10279. PMLR, 2022.
  17. Proportional response: Contextual bandits for simple and cumulative regret minimization. Advances in Neural Information Processing Systems, 36, 2024.
  18. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
  19. Cost-effective online contextual model selection. arXiv preprint arXiv:2207.06030, 2022.
  20. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  21. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  22. G. Neu. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. Advances in Neural Information Processing Systems, 28, 2015.
  23. Randomized prior functions for deep reinforcement learning. In Advances in Neural Information Processing Systems, pages 8617–8629, 2018.
  24. Deep exploration via bootstrapped dqn. In Advances in Neural Information Processing Systems, pages 4026–4034, 2016.
  25. O. Rivlin. Mountaincar_dqn_rnd. https://github.com/orrivlin/MountainCar_DQN_RND, 2019.
  26. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  27. Gaussian process optimization in the bandit setting: no regret and experimental design. In Proceedings of the 27th International Conference on Machine Learning, 2010.
  28. Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814, 2015.
  29. An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
  30. M. Tokic. Adaptive ε𝜀\varepsilonitalic_ε-greedy exploration in reinforcement learning based on value differences. In Annual Conference on Artificial Intelligence, pages 203–210. Springer, 2010.
  31. Ensemble of feature sets and classification algorithms for sentiment classification. Information Sciences, 181(6):1138–1152, 2011.
  32. M. Xu and D. Klabjan. Decentralized randomly distributed multi-agent multi-armed bandit with heterogeneous rewards. Advances in Neural Information Processing Systems, 36, 2024.
  33. Fully decentralized multi-agent reinforcement learning with networked agents. In International Conference on Machine Learning, pages 5872–5881. PMLR, 2018.
  34. S. Zuo. Near optimal adversarial attack on UCB bandits. arXiv preprint arXiv:2008.09312, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Mengfan Xu (9 papers)
  2. Diego Klabjan (111 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets