2000 character limit reached
Super-Exponential Regret for UCT, AlphaGo and Variants (2405.04407v2)
Published 7 May 2024 in cs.LG and cs.AI
Abstract: We improve the proofs of the lower bounds of Coquelin and Munos (2007) that demonstrate that UCT can have $\exp(\dots\exp(1)\dots)$ regret (with $\Omega(D)$ exp terms) on the $D$-chain environment, and that a `polynomial' UCT variant has $\exp_2(\exp_2(D - O(\log D)))$ regret on the same environment -- the original proofs contain an oversight for rewards bounded in $[0, 1]$, which we fix in the present draft. We also adapt the proofs to AlphaGo's MCTS and its descendants (e.g., AlphaZero, Leela Zero) to also show $\exp_2(\exp_2(D - O(\log D)))$ regret.
- Bandit algorithms for tree search. In Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence, UAI’07, page 67–74, Arlington, Virginia, USA, 2007. AUAI Press.
- Bandit based monte-carlo planning. In Johannes Fürnkranz, Tobias Scheffer, and Myra Spiliopoulou, editors, Machine Learning: ECML 2006, pages 282–293, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg.
- Mastering the game of go with deep neural networks and tree search. Nature, 529:484–503, 2016.
- Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.
- Mastering the game of go without human knowledge. Nature, 550(7676):354–359, October 2017.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.