Papers
Topics
Authors
Recent
Search
2000 character limit reached

Mastermind-Go: LLM for Advanced Go Strategy

Updated 5 February 2026
  • Mastermind-Go is a specialized large language model agent fine-tuned to excel in the game of Go through algorithmic data synthesis and strategic self-play.
  • Its multi-task training pipeline integrates state-transition prediction, KataGo analysis, and expert commentary, achieving near-perfect simulation and enhanced decision-making.
  • The model demonstrates significant transfer improvements on structured reasoning tasks, highlighting the effectiveness of data-centric methods over traditional RL approaches.

Mastermind-Go is a LLM agent fine-tuned to master the game of Go through algorithmic data synthesis, with the goal of inducing strong decision-making and strategic reasoning capabilities. Developed by post-training a LLaMA-2-7B transformer using curated data based on self-play, MCTS-based analysis, and expert commentary, Mastermind-Go achieves near-perfect simulation of Go’s environment dynamics and demonstrates transfer improvements in general reasoning tasks. The methodology for constructing and evaluating Mastermind-Go exemplifies a shift from RL-driven paradigms toward offline, multi-source data-centric approaches in complex decision game modeling (Wang et al., 18 Mar 2025).

1. Data Synthesis Pipeline

The Mastermind-Go pipeline converts the 19×19 Go board and move histories into structured text suitable for LLM processing. Each intersection is encoded as “#” (Black), “o” (White), or “•” (empty), with the most recent Black and White stones further labeled using move indices (“#(k)”, “o(k)”). This state, linearized as coordinate-symbol pairs, forms the context SS, and a move aa is represented by its coordinate (e.g., “G7”) or by “pass”.

Three principal synthetic datasets and a combined task are constructed:

  • Task 1: State-Transition 150,000 examples of the form (s,as)(s, a \to s'). Each sample requests the model to predict the exact next board state ss' after move aa is played on ss, including all captures.
  • Task 2: KataGo Analysis From self-play games generated by KataGo, candidate moves {a1,...,ak}\{a_1, ..., a_k\} are extracted for each state ss. For each aia_i, the dataset stores the ownership map OiO_i (color control of board points), score difference Δi\Delta_i (territory count), and win probability wiw_i. Two sampling regimes are included: “KataGo-9d” (highest-PUCT move) and “KataGo-suboptimal” (sampled via top-p from the policy net π(s)\pi(s)), yielding approximately 138,000 samples.
  • Task 3: Book Commentary 1,500 position-explanation pairs from Lee Sedol’s self-commentaries, prompting for extended natural language explanations of strategy and influence.
  • Task 4: Combined Decision Mixed-task samples integrating rules, analysis, and commentary into end-to-end chains for joint learning.

The MCTS data extraction leverages a standard PUCT formulation analogous to AlphaZero/KataGo, with policy-prior P(s,a)P(s,a), value Q(s,a)Q(s,a), and neural net rollout values vv:

1
2
3
4
5
Selection:    a^* = \arg\max_a [Q(s,a) + c_{puct} P(s,a) \frac{\sqrt{\sum_b N(s,b)}}{1+N(s,a)}]
Expansion/Eval:   (P(\cdot), v) = neural\_net(s_{leaf})
Backup:   for each visited (s, a):
            N(s,a) ← N(s,a)+1,
            Q(s,a) ← (N(s,a) Q(s,a) + v) / (N(s,a) + 1)

2. Architecture and Learning Regimen

Mastermind-Go fine-tunes a base LLaMA-2-7B transformer without structural changes or new RL, policy, or value heads. Instead, four tasks are cast as textual Q&A pairs amenable to sequence-to-sequence optimization. At inference, a minimal external tool (“Count”) is employed to aggregate territory counts from predicted ownership maps—a lightweight functional interface without learned heads.

All training samples (Xi,Yi)(X_i, Y_i) are optimized by a pure supervised learning objective:

L(θ)=1Ni=1Nlogpθ(YiXi)\mathcal{L}(\theta) = -\frac{1}{N} \sum_{i=1}^N \log p_\theta(Y_i \mid X_i)

No actor-critic, RLHF, or other reinforcement objectives are utilized.

A curriculum strategy weights Task 1 (rules) more heavily at early stages, which then transitions to increased inclusion of Tasks 2/3/4 to cultivate evaluation and strategic language generation.

3. Training Hyperparameters and Dataset Scale

Mastermind-Go is trained using the following configuration (as specified in Appendix C.1):

  • Optimizer: AdamW, weight decay 0.01, β1=0.99\beta_1=0.99, β2=0.999\beta_2=0.999
  • Learning rate: 5×1055 \times 10^{-5}, cosine decay, 3k warmup steps
  • Batch size: 8 sequences per GPU, 8 × A100 GPUs
  • LoRA rank r=32r=32, α=64\alpha=64, dropout $0.1$
  • Gradient clipping norm: $1.0$

Dataset composition:

Task # Samples Tokens (T)
Go-next-state 150,000 0.23
Go-analysis 138,000 0.22
Go-state-expl 1,500 0.01

4. Evaluation and Performance Metrics

Proxy task evaluation (Table 4) benchmarks Mastermind-Go on next-state prediction, score/winrate estimation, and explanation generation:

Model ss′ Acc. ↑ Score MAE WinRate MAE ↓ expl RL ↑ expl ppl ↓
LLaMA-2-7B (base) 0.00% N/A N/A 0.28 11.45
Single-task SFT 99.44% 1.80 5.14% 0.44 5.23
Multi-task SFT 96.08% 1.74 4.49% 0.43 3.64
  • ss′ Acc. = fraction of perfect next-state predictions
  • Score MAE = mean absolute error of point differential Δ\Delta
  • WinRate MAE = mean absolute error (percentage points) of ww
  • expl RL = Rouge-L score versus gold strategic explanations
  • expl ppl = perplexity of gold explanations under the fine-tuned model

Multi-task SFT yields slightly lower next-state accuracy (96%) than the single-task variant (99%), but improves on language modeling metrics (final perplexity 3.64 vs 5.23). This balance indicates that concurrent training over rule simulation, value analysis, and commentary leads to broader skill generalization. No Elo or direct play-vs-bot ratings are reported for Mastermind-Go; only proxy tasks are evaluated (Wang et al., 18 Mar 2025).

5. Generalization to Unseen Reasoning Tasks

Transfer to the BIG-Bench Hard (BBH) suite is evaluated (Table 7), focusing on long-sequence and structured reasoning:

Task LLaMA-2-7B Mastermind-Go
TempSeq 12.00% 20.40%
PengTab 31.51% 29.45%
Snarks 47.75% 51.69%
RuinNames 32.80% 39.20%
Hyper. 51.60% 51.60%
Nav. 53.60% 60.00%

Long-sequence/structured tasks (TempSeq, Snarks, Nav) show gains of 6–8 percentage points, while other BBH subtasks (e.g., PengTab) remain comparable or slightly lower. These improvements are attributed to the presence of chain-of-thought style sequences, logical state transition exposure, and long-context modeling demanded by Go. The authors document some mild catastrophic forgetting on unrelated date/spatial reasoning skills, suggesting future refinement by blending game data with standard reasoning benchmarks (Wang et al., 18 Mar 2025). This suggests the curriculum instills generalizable skills, but inclusion of a wider array of task-specific data may be necessary to avoid unwanted interference.

6. Significance and Implications

Mastermind-Go demonstrates that an LLM can be endowed with high-fidelity environment simulation and nuanced strategic analysis in Go solely through algorithmic data synthesis, without policy/value head additions or on-policy RL. The approach achieves:

  • \sim96–99% next-state prediction accuracy
  • Mean absolute errors of \sim1.7 points (score) and \sim4.5 percentage points (win probability)
  • Strategic explanation quality (Rouge-L \sim0.43, perplexity \sim3.6) comparable to expert commentary
  • Transfer enhancement on long-context, structured reasoning tasks

This paradigm validates the use of large-scale synthetic and commentary data, combined with modular curriculum mixing, as an alternative to reinforcement learning-centric pipelines in complex strategic domains. A plausible implication is that further integration of diverse, structure-rich reasoning data may yield LLMs with even broader transfer outside the domain of games.

7. Context Within Decision Game LLM Research

Mastermind-Go’s design, emphasizing off-policy, multi-task textual supervision, contrasts with prior RL-driven frameworks (e.g., DeepMind AlphaZero, policy/value RL fine-tuning for LLMs). By leveraging MCTS-based self-play, model-based evaluation signals, and professional commentaries in a unified textual form, the approach bypasses the need for custom model heads or reward-driven exploration. The explicit chain-of-thought and state evolution data are posited as key catalysts for improved reasoning generalization (Wang et al., 18 Mar 2025). This methodology offers a candidate blueprint for LLM adaptation to other decision games and high-structure reasoning domains, advancing the frontier of algorithmic knowledge synthesis in foundation models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mastermind-Go.