Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm (1712.01815v1)

Published 5 Dec 2017 in cs.AI and cs.LG

Abstract: The game of chess is the most widely-studied domain in the history of artificial intelligence. The strongest programs are based on a combination of sophisticated search techniques, domain-specific adaptations, and handcrafted evaluation functions that have been refined by human experts over several decades. In contrast, the AlphaGo Zero program recently achieved superhuman performance in the game of Go, by tabula rasa reinforcement learning from games of self-play. In this paper, we generalise this approach into a single AlphaZero algorithm that can achieve, tabula rasa, superhuman performance in many challenging domains. Starting from random play, and given no domain knowledge except the game rules, AlphaZero achieved within 24 hours a superhuman level of play in the games of chess and shogi (Japanese chess) as well as Go, and convincingly defeated a world-champion program in each case.

Citations (1,625)

View on Semantic Scholar

Summary

The paper demonstrates that a generalized reinforcement learning algorithm, AlphaZero, can quickly surpass elite chess and shogi programs through self-play refinement.
It uniquely combines deep neural networks with Monte-Carlo Tree Search to optimize move selection while reducing the number of evaluated positions compared to conventional methods.
Experimental results show that AlphaZero defeats world-champion engines like Stockfish and Elmo within hours, underscoring its computational efficiency and broad applicability.

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

This paper presents AlphaZero, a generalized reinforcement learning algorithm designed to achieve superhuman performance across multiple complex domains without any domain-specific knowledge, except for the game rules. The algorithm builds upon the success of AlphaGo Zero, which demonstrated remarkable performance in the game of Go. AlphaZero extends this capability to other strategic games, namely chess and shogi.

Algorithm Overview

AlphaZero synthesizes deep neural networks with Monte-Carlo Tree Search (MCTS) to efficiently explore vast state spaces characteristic of chess, shogi, and Go. The neural network, parameterized by $\theta$ , outputs the move probabilities $\mathbf{p}$ and state value $v$ for any given position $s$ . Through self-play, AlphaZero iteratively improves its policy $(\mathbf{p}_\theta)$ and value estimates $(v_\theta)$ , facilitating its MCTS to conduct more targeted and efficient searches.

Training and Evaluation

AlphaZero’s training paradigm involves a robust regime of self-play and reinforcement learning, fulfilling several key milestones:

Chess: Outperformed the TCEC 2016 world-champion program Stockfish after four hours of training.
Shogi: Surpassed the 2017 CSA world-champion program Elmo in less than two hours.
Go: Demonstrated superiority over AlphaGo Lee after eight hours, achieving this with a fraction of the computational resources and time used in prior models.

Experimental Results

Detailed results from evaluation matches highlight the efficacy of AlphaZero:

Chess: In matches comprising 100 games at standard tournament time controls (one minute per move), AlphaZero defeated Stockfish convincingly, losing zero games and drawing or winning the remaining matches.
Shogi: Exhibited strong performance against Elmo, suffering only minimal losses.
Go: Achieved consistent victories against prior versions of AlphaGo Zero, solidifying its generalization capabilities.

Figure~\ref{fig:training} and Table~\ref{tab:results} provide quantitative insights into the performance trajectory of AlphaZero during training and the outcomes of the evaluation matches.

Computational Efficiency

AlphaZero’s efficiency is notable, as its MCTS evaluates significantly fewer positions per second compared to traditional alpha-beta search engines yet achieves superior performance. The search extends over critical lines of play through selective deep dives facilitated by the neural network’s policy and value estimates. This contrasts starkly with engines like Stockfish and Elmo, which rely on exhaustive search spaces and human-crafted heuristics.

Implications and Future Directions

Practically, AlphaZero's ability to master multiple strategy games from scratch showcases the potential for generalized algorithms in varied domains. Theoretically, the results challenge traditional beliefs regarding the supremacy of alpha-beta search in strategic games, positing Monte-Carlo methods augmented with neural networks as a viable and often superior alternative.

Future developments in AI might expand upon this framework to tackle real-time decision-making tasks and complex simulations beyond board games. Further investigations could integrate domain-specific tweaks or multi-domain learning capabilities to further enhance AlphaZero’s adaptability and performance.

Conclusion

AlphaZero is a significant advancement in the application of reinforcement learning to complex strategy games. By eschewing domain-specific knowledge and employing a unified approach to learning, it transcends the limitations of traditional game-specific algorithms, pointing toward a new horizon in the development of general AI systems. The convergence of deep learning and MCTS augurs well for applications requiring strategic planning and real-time decision-making, holding promise for diverse and impactful AI-driven innovations.