Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Monte Carlo Tree Search with Boltzmann Exploration (2404.07732v1)

Published 11 Apr 2024 in cs.AI and cs.LG

Abstract: Monte-Carlo Tree Search (MCTS) methods, such as Upper Confidence Bound applied to Trees (UCT), are instrumental to automated planning techniques. However, UCT can be slow to explore an optimal action when it initially appears inferior to other actions. Maximum ENtropy Tree-Search (MENTS) incorporates the maximum entropy principle into an MCTS approach, utilising Boltzmann policies to sample actions, naturally encouraging more exploration. In this paper, we highlight a major limitation of MENTS: optimal actions for the maximum entropy objective do not necessarily correspond to optimal actions for the original objective. We introduce two algorithms, Boltzmann Tree Search (BTS) and Decaying ENtropy Tree-Search (DENTS), that address these limitations and preserve the benefits of Boltzmann policies, such as allowing actions to be sampled faster by using the Alias method. Our empirical analysis shows that our algorithms show consistent high performance across several benchmark domains, including the game of Go.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235–256, 2002.
  2. Continuous upper confidence trees with polynomial exploration–consistency. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 194–209. Springer, 2013.
  3. Richard Bellman. A markovian decision process. Journal of mathematics and mechanics, 6(5):679–684, 1957.
  4. Labeled rtdp: Improving the convergence of real-time dynamic programming. In ICAPS, volume 3, pages 12–21, 2003.
  5. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
  6. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43, 2012.
  7. Pure exploration in multi-armed bandits problems. In International conference on Algorithmic learning theory, pages 23–37. Springer, 2009.
  8. Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Computer Science, 412(19):1832–1852, 2011.
  9. Bandit algorithms for tree search. In Uncertainty in Artificial Intelligence, 2007.
  10. Convex regularization in monte-carlo tree search. In International Conference on Machine Learning, pages 2365–2375. PMLR, 2021.
  11. Monte-carlo planning: Theoretically fast convergence meets practical efficiency. arXiv preprint arXiv:1309.6828, 2013.
  12. On mabs and separation of concerns in monte-carlo planning for mdps. In Twenty-Fourth International Conference on Automated Planning and Scheduling, 2014.
  13. Simple regret optimization in online planning for markov decision processes. Journal of Artificial Intelligence Research, 51:165–205, 2014.
  14. Combining online and offline knowledge in uct. In Proceedings of the 24th international conference on Machine learning, pages 273–280, 2007.
  15. Monte-carlo tree search and rapid action value estimation in computer go. Artificial Intelligence, 175(11):1856–1875, 2011.
  16. Reinforcement learning with deep energy-based policies. In International Conference on Machine Learning, pages 1352–1361. PMLR, 2017.
  17. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR, 2018.
  18. Lao*: A heuristic search algorithm that finds solutions with loops. Artificial Intelligence, 129(1-2):35–62, 2001.
  19. Selecting computations: Theory and applications. arXiv preprint arXiv:1408.2048, 2014.
  20. Almost optimal exploration in multi-armed bandits. In International Conference on Machine Learning, pages 1238–1246. PMLR, 2013.
  21. Prost: Probabilistic planning based on uct. In Twenty-Second International Conference on Automated Planning and Scheduling, 2012.
  22. Bandit based monte-carlo planning. In European conference on machine learning, pages 282–293. Springer, 2006.
  23. Improved monte-carlo search. Univ. Tartu, Estonia, Tech. Rep, 1, 2006.
  24. Andrey Kolobov. Planning with Markov Decision Processes: An AI Perspective, volume 6. Morgan & Claypool Publishers, 2012.
  25. Planning and learning using adaptive entropy tree search. arXiv preprint arXiv:2102.06808, 2021.
  26. Bridging the gap between value and policy based reinforcement learning. Advances in neural information processing systems, 30, 2017.
  27. Minimizing simple and cumulative regret in monte-carlo tree search. In Workshop on Computer Games, pages 1–15. Springer, 2014.
  28. On-line search for solving markov decision processes via heuristic sampling. learning, 16:2, 2004.
  29. Christopher D Rosin. Multi-armed bandits with episode context. Annals of Mathematics and Artificial Intelligence, 61(3):203–230, 2011.
  30. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
  31. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.
  32. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
  33. Mcts based on simple regret. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 26, pages 570–576, 2012.
  34. Michael D Vose. A linear algorithm for generating random numbers with a given distribution. IEEE Transactions on software engineering, 17(9):972–975, 1991.
  35. Alastair J Walker. New fast method for generating discrete random numbers with arbitrary frequency distributions. Electronics Letters, 8(10):127–128, 1974.
  36. David J Wu. Accelerating self-play learning in go. arXiv preprint arXiv:1902.10565, 2019.
  37. Maximum entropy monte-carlo planning. Advances in Neural Information Processing Systems, 32, 2019.
  38. Maximum entropy inverse reinforcement learning. 2008.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com