Mastermind-Dou: Multi-Domain Strategies
- Mastermind-Dou is a multifaceted construct that integrates adversarial jailbreak frameworks, LLM-based game decision making for Doudizhu, and linear-query algorithms for efficient Mastermind solving.
- In its adversarial jailbreak application, it employs multi-turn, hierarchical planning and feedback loops to achieve high attack success rates against leading LLM defenses.
- As a game agent and combinatorial solver, Mastermind-Dou leverages expert trajectory synthesis and binary-tree token sliding to reach up to 90% action accuracy and O(n) query efficiency.
Mastermind-Dou encompasses a set of technically distinct but nomenclaturally related constructs at the intersection of combinatorial search, adversarial language modeling, and game-theoretic deep learning. The term “Mastermind-Dou” appears in three principal domains: (1) as the codename for an LLM-based Doudizhu card game agent, (2) as a designation for a sharply optimal algorithm in black-peg Mastermind where the alphabet and code length coincide, and (3) as an instantiation of a self-improving, multi-turn jailbreak framework for LLMs. These usages exhibit no historical linkage but share a methodological emphasis on planning in adversarial or imperfect information environments.
1. Mastermind-Dou in Adversarial Jailbreaking of LLMs
Mastermind-Dou serves as an advanced, knowledge-driven multi-turn jailbreak agent, engineered for maximally effective evasion of state-of-the-art LLM defenses and the controlled induction of harmful outputs (Li et al., 9 Jan 2026). The framework operationalizes adversarial red teaming as a multi-turn, closed-loop Markovian process over conversation histories.
Formal Structure
- State Definition: At turn , the state is with denoting the sequence of user–assistant pairs, the harmful seed query, and the target objective.
- Planning: A Planner outputs a multi-step plan , each representing a high-level adversarial sub-goal (e.g., persona adoption, masking intent).
- Execution: An Executor generates the current prompt .
- Control and Success Evaluation: A Controller 0 determines if response 1 advances 2, refining or aborting as necessary. Success is declared when judge score 3 surpasses a threshold.
Hierarchical Planning and Knowledge Integration
- Hierarchical Milestones: High-level objectives 4 and low-level tactics 5 jointly optimize 6 via a loss that balances objective alignment (7) and coherence (8).
- Repository Formalism: Mastermind-Dou continually refines a knowledge repository 9 of reusable adversarial patterns, updated via feedback-driven extraction and pruning.
Closed-Loop Adaptation
Reflection 0 remediates failed plans by optimizing for minimal redo errors and preservation of successful priors. Dynamic recombination uses a binary encoding of tactics and evolutionary operators (crossover, mutation, selection proportional to vulnerability oracle feedback) to efficiently navigate tactic combinatorics.
Empirical Impact
On HarmBench and StrongReject, Mastermind-Dou achieved attack success rates (ASR) of 67% (Claude 3.7 Sonnet) and 60% (GPT-5), outperforming X-Teaming and maintaining robustness even under advanced LLM defenses. Harmfulness ratings (HR) were also highest among tested baselines (Li et al., 9 Jan 2026).
2. Mastermind-Dou as the LLM-Based Doudizhu Agent
In LLM-empowered decision-making, Mastermind-Dou is a specialized agent for the 3-player imperfect-information card game Doudizhu. It combines algorithmic data synthesis with multi-head LLM finetuning to match or surpass state-of-the-art RL and rule-based agents (Wang et al., 18 Mar 2025).
Data Synthesis Pipeline
- Expert Trajectory Generation: Synthetic state-action trajectories are generated with three expert agents: RLCard’s rule-based policy, a supervised human-data mimic, and DouZero (Q-learning expert).
- Top-1 Filtering: At each state 2, actions are scored via DouZero’s 3 network and filtered to the minimal set 4 covering cumulative probability 5, restricting the action prediction space.
- Imperfect-Information Modeling: For each candidate move 6, downstream responses of the next two agents are recorded, introducing an opponent-strategy prediction head 7 trained with cross-entropy.
Model and Training
- Base: LLaMA-2-7B, LoRA (rank 32, 8), 8×A100 GPUs.
- Input Encoding: Cards as integers (e.g., 9, 0, Jokers up to 1); action lists are sorted int-encoded vectors.
- Heads: (1) Possible Action Prediction (ranking next action, token-by-token), (2) Opponent Strategy Prediction (linear layer for 2 probability).
- Loss: 3 (4), where
5
Empirical Results
Mastermind-Dou with probability-chain outperformes strong baselines and non-expert LLMs by a wide margin. Action accuracy reaches 90%, with win rates as landlord versus RLCard and DouZero of 90% and 41% respectively—matching DouZero’s expert performance (Wang et al., 18 Mar 2025).
Table: Mastermind-Dou Key Results (Excerpt of Table 2, (Wang et al., 18 Mar 2025))
| Model | RLCard Win Rate | DouZero Win Rate |
|---|---|---|
| Mastermind-Dou with prob | 90% | 41% |
| DouZero (expert) | 90% | 43% |
| LLaMA-2-7B (few-shot+sim) | 12% | 3% |
Additionally, post-training on Doudizhu data yielded improved performance on BIG-Bench Hard reasoning tasks, though some catastrophic forgetting appeared on spatial/date subdomains.
3. Mastermind-Dou and Query Complexity in Black-Peg Mastermind
In combinatorial search, “Mastermind-Dou” (as an Editor's term) refers to the solution of Mastermind with 6 using 7 black-peg queries, resolving an open efficiency gap (Martinsson et al., 2020).
Problem Statement
In 8-color, 9-position black-peg Mastermind, the codemaker picks 0; codebreaker queries 1, receiving 2.
Main Result
For 3, there exists a randomized algorithm recovering 4 in 5 queries, tight by the entropy lower bound (each query leaks 6 bits, total information 7 bits, yielding 8 lower bound).
Algorithmic Outline
The key innovation is reducing to a “signed-permutation” Mastermind, where the secret is a permutation and queries can freely set positive/negative markers. The core algorithm uses an “information-tree token-sliding” method:
- Binary-tree token sliding: Encode the 9 code positions as leaves of a complete binary tree. For each color, use a “token” propagated from the root to leaves, with queries partitioning at each node to localize the exact position.
- Query Compression: A two-phase approach—preprocessing and solve—partitions and compresses the search; at each recursive step, three independent queries are collapsed into two via a Cantor–Mills–style linear combination, ensuring 0 total complexity.
- Key Lemmas: Existential results for “zero” and “distinct-one” queries (for blanking and uniquely identifying colors), as well as query-combining lemma allowing parallel disjoint query resolution.
Generalization
Extending to arbitrary 1, the randomized query complexity is:
- 2 (black-white peg)
- 3 (black-only) These results synthesize previous bounds [Chvátal 1983, Doerr et al. 2016].
4. Cross-Domain Methodological Parallels
While the three usages of Mastermind-Dou target unrelated problems, common patterns can be abstracted:
- Hierarchical/Recursive Planning: All apply multi-level planning or recursive task decomposition—binary tree token sliding, high-level/low-level adversarial planning, or multi-stage Doudizhu move selection.
- Combining Information Efficiently: Exploiting the informational content of each action (query, prompt, or move) and adaptively focusing resources via probability mass, reflection, or tree-partitioning.
- Closed-Loop Feedback: Each system (query complexity, LLM game reasoning, adversarial jailbreaks) incorporates feedback—either via information-theoretic bounds, loss surfaces, or explicit success/failure scoring—into iterative refinement.
5. Impact and Benchmarking
Mastermind-Dou establishes new benchmarks across all three domains:
- Combinatorial Search: First linear-query complexity for 4 Mastermind, closing a decades-old open gap and yielding tight bounds for arbitrary parameter regimes (Martinsson et al., 2020).
- LLM Game Competency: Matching RL experts in Doudizhu action accuracy and win-rate, validating algorithmic data synthesis as a paradigm for LLM deployment in imperfect-information games (Wang et al., 18 Mar 2025).
- Jailbreak Adversariality: State-of-the-art attack effectiveness on LLMs under advanced defenses, generalizing across open and closed-source targets and outperforming strong baselines (Li et al., 9 Jan 2026).
6. Technical Case Studies and Pseudocode
Doudizhu LLM Pipeline Skeleton (Wang et al., 18 Mar 2025):
6
Mastermind-Dou Planning Loop (Li et al., 9 Jan 2026):
7
7. References
- (Martinsson et al., 2020) "Mastermind with a Linear Number of Queries" (Martinsson & Su), query complexity for 5 Mastermind.
- (Wang et al., 18 Mar 2025) "Empowering LLMs in Decision Games through Algorithmic Data Synthesis," details the Doudizhu LLM agent architecture and performance.
- (Li et al., 9 Jan 2026) "Knowledge-Driven Multi-Turn Jailbreaking on LLMs," describes Mastermind-Dou for adversarial LLM exploitation.
A plausible implication is that the Mastermind-Dou naming convention will persist as a marker for technically sophisticated, feedback-driven, and adversarially optimized agents in combinatorial, game-theoretic, and red-teaming domains.