Mastermind-Go: LLM for Advanced Go Strategy

Updated 5 February 2026

Mastermind-Go is a specialized large language model agent fine-tuned to excel in the game of Go through algorithmic data synthesis and strategic self-play.
Its multi-task training pipeline integrates state-transition prediction, KataGo analysis, and expert commentary, achieving near-perfect simulation and enhanced decision-making.
The model demonstrates significant transfer improvements on structured reasoning tasks, highlighting the effectiveness of data-centric methods over traditional RL approaches.

Mastermind-Go is a LLM agent fine-tuned to master the game of Go through algorithmic data synthesis, with the goal of inducing strong decision-making and strategic reasoning capabilities. Developed by post-training a LLaMA-2-7B transformer using curated data based on self-play, MCTS-based analysis, and expert commentary, Mastermind-Go achieves near-perfect simulation of Go’s environment dynamics and demonstrates transfer improvements in general reasoning tasks. The methodology for constructing and evaluating Mastermind-Go exemplifies a shift from RL-driven paradigms toward offline, multi-source data-centric approaches in complex decision game modeling (Wang et al., 18 Mar 2025).

1. Data Synthesis Pipeline

The Mastermind-Go pipeline converts the 19×19 Go board and move histories into structured text suitable for LLM processing. Each intersection is encoded as “#” (Black), “o” (White), or “•” (empty), with the most recent Black and White stones further labeled using move indices (“#(k)”, “o(k)”). This state, linearized as coordinate-symbol pairs, forms the context $S$ , and a move $a$ is represented by its coordinate (e.g., “G7”) or by “pass”.

Three principal synthetic datasets and a combined task are constructed:

Task 1: State-Transition 150,000 examples of the form $(s, a \to s')$ . Each sample requests the model to predict the exact next board state $s'$ after move $a$ is played on $s$ , including all captures.
Task 2: KataGo Analysis From self-play games generated by KataGo, candidate moves $\{a_1, ..., a_k\}$ are extracted for each state $s$ . For each $a_i$ , the dataset stores the ownership map $O_i$ (color control of board points), score difference $\Delta_i$ (territory count), and win probability $w_i$ . Two sampling regimes are included: “KataGo-9d” (highest-PUCT move) and “KataGo-suboptimal” (sampled via top-p from the policy net $\pi(s)$ ), yielding approximately 138,000 samples.
Task 3: Book Commentary 1,500 position-explanation pairs from Lee Sedol’s self-commentaries, prompting for extended natural language explanations of strategy and influence.
Task 4: Combined Decision Mixed-task samples integrating rules, analysis, and commentary into end-to-end chains for joint learning.

The MCTS data extraction leverages a standard PUCT formulation analogous to AlphaZero/KataGo, with policy-prior $P(s,a)$ , value $Q(s,a)$ , and neural net rollout values $v$ :

Selection:    a^* = \arg\max_a [Q(s,a) + c_{puct} P(s,a) \frac{\sqrt{\sum_b N(s,b)}}{1+N(s,a)}]
Expansion/Eval:   (P(\cdot), v) = neural\_net(s_{leaf})
Backup:   for each visited (s, a):
            N(s,a) ← N(s,a)+1,
            Q(s,a) ← (N(s,a) Q(s,a) + v) / (N(s,a) + 1)

2. Architecture and Learning Regimen

Mastermind-Go fine-tunes a base LLaMA-2-7B transformer without structural changes or new RL, policy, or value heads. Instead, four tasks are cast as textual Q&A pairs amenable to sequence-to-sequence optimization. At inference, a minimal external tool (“Count”) is employed to aggregate territory counts from predicted ownership maps—a lightweight functional interface without learned heads.

All training samples $(X_i, Y_i)$ are optimized by a pure supervised learning objective:

$\mathcal{L}(\theta) = -\frac{1}{N} \sum_{i=1}^N \log p_\theta(Y_i \mid X_i)$

No actor-critic, RLHF, or other reinforcement objectives are utilized.

A curriculum strategy weights Task 1 (rules) more heavily at early stages, which then transitions to increased inclusion of Tasks 2/3/4 to cultivate evaluation and strategic language generation.

3. Training Hyperparameters and Dataset Scale

Mastermind-Go is trained using the following configuration (as specified in Appendix C.1):

Optimizer: AdamW, weight decay 0.01, $\beta_1=0.99$ , $\beta_2=0.999$
Learning rate: $5 \times 10^{-5}$ , cosine decay, 3k warmup steps
Batch size: 8 sequences per GPU, 8 × A100 GPUs
LoRA rank $r=32$ , $\alpha=64$ , dropout $0.1$
Gradient clipping norm: $1.0$

Dataset composition:

Task	# Samples	Tokens (T)
Go-next-state	150,000	0.23
Go-analysis	138,000	0.22
Go-state-expl	1,500	0.01

4. Evaluation and Performance Metrics

Proxy task evaluation (Table 4) benchmarks Mastermind-Go on next-state prediction, score/winrate estimation, and explanation generation:

Model	$s′$ Acc. ↑	Score MAE ↓	WinRate MAE ↓	expl RL ↑	expl ppl ↓
LLaMA-2-7B (base)	0.00%	N/A	N/A	0.28	11.45
Single-task SFT	99.44%	1.80	5.14%	0.44	5.23
Multi-task SFT	96.08%	1.74	4.49%	0.43	3.64

$s′$ Acc. = fraction of perfect next-state predictions
Score MAE = mean absolute error of point differential $\Delta$
WinRate MAE = mean absolute error (percentage points) of $w$
expl RL = Rouge-L score versus gold strategic explanations
expl ppl = perplexity of gold explanations under the fine-tuned model

Multi-task SFT yields slightly lower next-state accuracy (96%) than the single-task variant (99%), but improves on language modeling metrics (final perplexity 3.64 vs 5.23). This balance indicates that concurrent training over rule simulation, value analysis, and commentary leads to broader skill generalization. No Elo or direct play-vs-bot ratings are reported for Mastermind-Go; only proxy tasks are evaluated (Wang et al., 18 Mar 2025).

5. Generalization to Unseen Reasoning Tasks

Transfer to the BIG-Bench Hard (BBH) suite is evaluated (Table 7), focusing on long-sequence and structured reasoning:

Task	LLaMA-2-7B	Mastermind-Go
TempSeq	12.00%	20.40%
PengTab	31.51%	29.45%
Snarks	47.75%	51.69%
RuinNames	32.80%	39.20%
Hyper.	51.60%	51.60%
Nav.	53.60%	60.00%

Long-sequence/structured tasks (TempSeq, Snarks, Nav) show gains of 6–8 percentage points, while other BBH subtasks (e.g., PengTab) remain comparable or slightly lower. These improvements are attributed to the presence of chain-of-thought style sequences, logical state transition exposure, and long-context modeling demanded by Go. The authors document some mild catastrophic forgetting on unrelated date/spatial reasoning skills, suggesting future refinement by blending game data with standard reasoning benchmarks (Wang et al., 18 Mar 2025). This suggests the curriculum instills generalizable skills, but inclusion of a wider array of task-specific data may be necessary to avoid unwanted interference.

6. Significance and Implications

Mastermind-Go demonstrates that an LLM can be endowed with high-fidelity environment simulation and nuanced strategic analysis in Go solely through algorithmic data synthesis, without policy/value head additions or on-policy RL. The approach achieves:

$\sim$ 96–99% next-state prediction accuracy
Mean absolute errors of $\sim$ 1.7 points (score) and $\sim$ 4.5 percentage points (win probability)
Strategic explanation quality (Rouge-L $\sim$ 0.43, perplexity $\sim$ 3.6) comparable to expert commentary
Transfer enhancement on long-context, structured reasoning tasks

This paradigm validates the use of large-scale synthetic and commentary data, combined with modular curriculum mixing, as an alternative to reinforcement learning-centric pipelines in complex strategic domains. A plausible implication is that further integration of diverse, structure-rich reasoning data may yield LLMs with even broader transfer outside the domain of games.

7. Context Within Decision Game LLM Research

Mastermind-Go’s design, emphasizing off-policy, multi-task textual supervision, contrasts with prior RL-driven frameworks (e.g., DeepMind AlphaZero, policy/value RL fine-tuning for LLMs). By leveraging MCTS-based self-play, model-based evaluation signals, and professional commentaries in a unified textual form, the approach bypasses the need for custom model heads or reward-driven exploration. The explicit chain-of-thought and state evolution data are posited as key catalysts for improved reasoning generalization (Wang et al., 18 Mar 2025). This methodology offers a candidate blueprint for LLM adaptation to other decision games and high-structure reasoning domains, advancing the frontier of algorithmic knowledge synthesis in foundation models.

Markdown Upgrade to Chat

References (1)

Empowering LLMs in Decision Games through Algorithmic Data Synthesis (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mastermind-Go.

Mastermind-Go: LLM for Advanced Go Strategy

1. Data Synthesis Pipeline

2. Architecture and Learning Regimen

3. Training Hyperparameters and Dataset Scale

4. Evaluation and Performance Metrics

5. Generalization to Unseen Reasoning Tasks

6. Significance and Implications

7. Context Within Decision Game LLM Research

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Mastermind-Go: LLM for Advanced Go Strategy

1. Data Synthesis Pipeline

2. Architecture and Learning Regimen

3. Training Hyperparameters and Dataset Scale

4. Evaluation and Performance Metrics

5. Generalization to Unseen Reasoning Tasks

6. Significance and Implications

7. Context Within Decision Game LLM Research

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research