Thinking Fast and Slow with Deep Learning and Tree Search (1705.08439v4)

Published 23 May 2017 in cs.AI and cs.LG

Abstract: Sequential decision making problems, such as structured prediction, robotic control, and game playing, require a combination of planning policies and generalisation of those plans. In this paper, we present Expert Iteration (ExIt), a novel reinforcement learning algorithm which decomposes the problem into separate planning and generalisation tasks. Planning new policies is performed by tree search, while a deep neural network generalises those plans. Subsequently, tree search is improved by using the neural network policy to guide search, increasing the strength of new plans. In contrast, standard deep Reinforcement Learning algorithms rely on a neural network not only to generalise plans, but to discover them too. We show that ExIt outperforms REINFORCE for training a neural network to play the board game Hex, and our final tree search agent, trained tabula rasa, defeats MoHex 1.0, the most recent Olympiad Champion player to be publicly released.

Citations (349)

View on Semantic Scholar

Summary

The paper introduces Expert Iteration (ExIt), which decouples planning via tree search and generalization via deep neural networks to improve sequential decision-making.
It demonstrates superior performance in Hex by outperforming traditional methods and achieving significant Elo rating improvements.
The dual-process framework combines fast, intuitive responses with deliberate planning, paving the way for advanced applications in robotics and autonomous systems.

Expert Iteration: Enhancing Sequential Decision Making through Deep Learning and Tree Search

The paper "Thinking Fast and Slow with Deep Learning and Tree Search" introduces a reinforcement learning algorithm known as Expert Iteration (ExIt). This algorithm innovatively addresses the challenges of sequential decision-making tasks by separating the planning and generalization responsibilities typically encompassed by deep reinforcement learning models. This novel approach leverages the strengths of tree search algorithms for planning and deep neural networks for policy generalization, proposing a method that potentially improves upon existing reinforcement learning methodologies.

The core innovation of ExIt lies in its decomposition of the reinforcement learning process into two distinct yet interdependent tasks: planning through tree search and generalization via neural networks. By utilizing tree search, ExIt effectively explores potential future states, analogous to the human capability of conducting lookahead reasoning or "System 2" thinking. Subsequently, deep neural networks generalize the insights gained from these tree searches across broader states, enhancing the efficiency and efficacy of these plans, akin to "System 1" fast thinking.

One notable application and source of empirical validation for ExIt is its performance in the board game Hex, where it demonstrated superiority over traditional algorithms like REINFORCE. ExIt's approach of iterative improvement through self-play and expert enhancement yielded a system capable of surpassing MoHex 1.0, the most recent publicly available Olympiad Champion player. This result underscores ExIt's potential for high-level strategy games and its effectiveness when initialized tabula rasa, without prior knowledge of optimal strategies.

The paper discusses the implementation of ExIt, highlighting the integration of Imitation Learning (IL) with reinforcement learning to continually refine the apprentice policy through expert improvement. The apprentice policy is embodied by a deep neural network, while the expert policy is executed through an advanced tree search method. The process iterates, each time using the strengthened apprentice to bootstrap the tree search, thus iteratively refining both strategy and speed — a process that captures the essence of human learning cycles.

Numerical results, such as those from the application of ExIt to Hex, reveal significant improvements in planning prowess compared to models using standard approaches. For instance, the paper reports a substantial improvement in performance measured in Elo ratings over initial training phases, illustrating the robust learning curve enabled by ExIt's dual-process strategy.

The implications of ExIt extend beyond board games like Hex. Its structured decomposition of decision processes holds promise for other complex, structured domains such as robotic control and autonomous systems, where strategic planning combined with rapid decision-making is crucial.

Looking forward, the development and refinement of ExIt could lead to more autonomous systems capable of better decision-making by integrating fast intuitive responses and slower, deliberate planning strategies. Potential developments could see enhanced applications in AI domains where such duality in thinking—fast heuristic response coupled with rigorous analytical planning—is vital.

In conclusion, the ExIt algorithm presents a significant advancement in reinforcement learning strategies by emulating human cognitive processes for strategic decision-making. Its success in complex domains like Hex suggests its applicability to broader AI challenges, paving the way for further research and development to fine-tune its components and explore its full potential in varied applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/KunhaoZ/status/1849500421758312831

https://twitter.com/justintchiu/status/1849500633449234458

https://twitter.com/eyal_eg/status/1849502363075346635

https://twitter.com/bekindtopeople2/status/1835814380321321389

https://twitter.com/BabyCoder_/status/1835867853699858441