Suphx: Mastering Mahjong with Deep Reinforcement Learning (2003.13590v2)

Published 30 Mar 2020 in cs.AI

Abstract: AI has achieved great success in many domains, and game AI is widely regarded as its beachhead since the dawn of AI. In recent years, studies on game AI have gradually evolved from relatively simple environments (e.g., perfect-information games such as Go, chess, shogi or two-player imperfect-information games such as heads-up Texas hold'em) to more complex ones (e.g., multi-player imperfect-information games such as multi-player Texas hold'em and StartCraft II). Mahjong is a popular multi-player imperfect-information game worldwide but very challenging for AI research due to its complex playing/scoring rules and rich hidden information. We design an AI for Mahjong, named Suphx, based on deep reinforcement learning with some newly introduced techniques including global reward prediction, oracle guiding, and run-time policy adaptation. Suphx has demonstrated stronger performance than most top human players in terms of stable rank and is rated above 99.99% of all the officially ranked human players in the Tenhou platform. This is the first time that a computer program outperforms most top human players in Mahjong.

Authors (10)

Junjie Li (98 papers)
Sotetsu Koyamada (8 papers)
Qiwei Ye (16 papers)
Guoqing Liu (42 papers)
Chao Wang (555 papers)
Ruihan Yang (43 papers)
Li Zhao (150 papers)
Tao Qin (201 papers)
Tie-Yan Liu (242 papers)
Hsiao-Wuen Hon (3 papers)

Citations (119)

View on Semantic Scholar

Summary

An Expert Review of "Suphx: Mastering Mahjong with Deep Reinforcement Learning"

The paper entitled "Suphx: Mastering Mahjong with Deep Reinforcement Learning" presents a sophisticated AI system designed to excel in the complex multiplayer game of Mahjong using novel reinforcement learning techniques. Given the intricacies of Mahjong’s scoring system, hidden information, and complex playing rules, this work constitutes a notable advancement in AI for imperfect-information games.

The authors confront the fundamental challenges posed by Mahjong, including the intricacies of its scoring mechanisms, the vast state space due to hidden tiles, and the irregularities in the game tree that preclude traditional methods like Monte Carlo Tree Search (MCTS). Their approach involves innovative methods like global reward prediction, oracle guiding, and parametric Monte Carlo policy adaptation to address these challenges effectively.

Methodology

Global Reward Prediction: A crucial part of the research is the introduction of a global reward predictor, aiming to distribute game-level rewards to individual rounds by capturing round-specific contributions to overall success. This predictor uses a recurrent neural network architecture to provide a more granular reward signal than the game's direct scoring would allow.
Oracle Guiding: Another novel technique is oracle guiding, where a model begins by utilizing perfect information to build a strong policy before gradually relinquishing this 'oracle' advantage, transitioning to a standard imperfect-information policy. This assists in accelerating learning by providing a potent initial policy that is systematically adjusted to operate under realistic constraints.
Parametric Monte Carlo Policy Adaptation (pMCPA): In lieu of traditional MCTS, the authors propose pMCPA to adapt policies during gameplay. This technique takes advantage of ongoing game states to fine-tune decision-making policies dynamically.

Results and Discussion

The AI developed, named Suphx, achieved a record rank and a high stable rank in the competitive environment of the Tenhou online platform, outperforming both existing AI systems and the majority of top human players, positioning itself among the elite ranks. The results demonstrated that Suphx, with a record rank of 10 dan and stable rank higher than most human players, is an exemplary demonstration of the potential of deep reinforcement learning in mastering complex games.

The empirical results underline the utility of the proposed techniques. Global reward prediction proved essential in aligning learning signals with strategic objectives, while oracle guiding effectively leveraged additional information to refine strategy development. pMCPA showed the benefits of continuous adaptation in an environment characterized by significant uncertainty and hidden information.

Implications

This research holds substantial implications for the domain of AI-driven game intelligence and beyond. By tackling the challenges inherent in Mahjong, Suphx highlights pathways to address similar difficulties in other domains requiring strategic decision-making with incomplete information. The methodologies here could be adapted to problems in finance, logistics, and other sectors requiring complex decision trees and strategic adaptation.

Future Directions

While Suphx marks a considerable step forward, the authors identify opportunities for improvement. These include enhancing the global reward predictor with additional features, refining oracle guiding via alternative approaches such as knowledge distillation, and extending pMCPA to utilize new game states continuously. Integrating these advancements could further elevate performance and lead to comprehensive strategies for other imperfect-information applications.

In conclusion, the paper not only contributes a competitive Mahjong AI but also enriches the broader AI field with its innovative approaches. It sets a precedent for employing deep reinforcement learning in strategic domains where information is inherently incomplete and rewards are multifaceted - establishing a solid foundation for future research and development in this exciting frontier.

PDF Markdown

Related Papers

YouTube

Show All Videos