Mastermind-Dou: Multi-Domain Strategies
- Mastermind-Dou is a multifaceted construct that integrates adversarial jailbreak frameworks, LLM-based game decision making for Doudizhu, and linear-query algorithms for efficient Mastermind solving.
- In its adversarial jailbreak application, it employs multi-turn, hierarchical planning and feedback loops to achieve high attack success rates against leading LLM defenses.
- As a game agent and combinatorial solver, Mastermind-Dou leverages expert trajectory synthesis and binary-tree token sliding to reach up to 90% action accuracy and O(n) query efficiency.
Mastermind-Dou encompasses a set of technically distinct but nomenclaturally related constructs at the intersection of combinatorial search, adversarial language modeling, and game-theoretic deep learning. The term “Mastermind-Dou” appears in three principal domains: (1) as the codename for an LLM-based Doudizhu card game agent, (2) as a designation for a sharply optimal algorithm in black-peg Mastermind where the alphabet and code length coincide, and (3) as an instantiation of a self-improving, multi-turn jailbreak framework for LLMs. These usages exhibit no historical linkage but share a methodological emphasis on planning in adversarial or imperfect information environments.
1. Mastermind-Dou in Adversarial Jailbreaking of LLMs
Mastermind-Dou serves as an advanced, knowledge-driven multi-turn jailbreak agent, engineered for maximally effective evasion of state-of-the-art LLM defenses and the controlled induction of harmful outputs (Li et al., 9 Jan 2026). The framework operationalizes adversarial red teaming as a multi-turn, closed-loop Markovian process over conversation histories.
Formal Structure
- State Definition: At turn , the state is with denoting the sequence of user–assistant pairs, the harmful seed query, and the target objective.
- Planning: A Planner outputs a multi-step plan , each representing a high-level adversarial sub-goal (e.g., persona adoption, masking intent).
- Execution: An Executor generates the current prompt .
- Control and Success Evaluation: A Controller determines if response advances , refining or aborting as necessary. Success is declared when judge score surpasses a threshold.
Hierarchical Planning and Knowledge Integration
- Hierarchical Milestones: High-level objectives and low-level tactics jointly optimize via a loss that balances objective alignment () and coherence ().
- Repository Formalism: Mastermind-Dou continually refines a knowledge repository of reusable adversarial patterns, updated via feedback-driven extraction and pruning.
Closed-Loop Adaptation
Reflection remediates failed plans by optimizing for minimal redo errors and preservation of successful priors. Dynamic recombination uses a binary encoding of tactics and evolutionary operators (crossover, mutation, selection proportional to vulnerability oracle feedback) to efficiently navigate tactic combinatorics.
Empirical Impact
On HarmBench and StrongReject, Mastermind-Dou achieved attack success rates (ASR) of 67% (Claude 3.7 Sonnet) and 60% (GPT-5), outperforming X-Teaming and maintaining robustness even under advanced LLM defenses. Harmfulness ratings (HR) were also highest among tested baselines (Li et al., 9 Jan 2026).
2. Mastermind-Dou as the LLM-Based Doudizhu Agent
In LLM-empowered decision-making, Mastermind-Dou is a specialized agent for the 3-player imperfect-information card game Doudizhu. It combines algorithmic data synthesis with multi-head LLM finetuning to match or surpass state-of-the-art RL and rule-based agents (Wang et al., 18 Mar 2025).
Data Synthesis Pipeline
- Expert Trajectory Generation: Synthetic state-action trajectories are generated with three expert agents: RLCard’s rule-based policy, a supervised human-data mimic, and DouZero (Q-learning expert).
- Top- Filtering: At each state , actions are scored via DouZero’s network and filtered to the minimal set covering cumulative probability , restricting the action prediction space.
- Imperfect-Information Modeling: For each candidate move , downstream responses of the next two agents are recorded, introducing an opponent-strategy prediction head trained with cross-entropy.
Model and Training
- Base: LLaMA-2-7B, LoRA (rank 32, ), 8×A100 GPUs.
- Input Encoding: Cards as integers (e.g., , , Jokers up to $30$); action lists are sorted int-encoded vectors.
- Heads: (1) Possible Action Prediction (ranking next action, token-by-token), (2) Opponent Strategy Prediction (linear layer for probability).
- Loss: (), where
Empirical Results
Mastermind-Dou with probability-chain outperformes strong baselines and non-expert LLMs by a wide margin. Action accuracy reaches 90%, with win rates as landlord versus RLCard and DouZero of 90% and 41% respectively—matching DouZero’s expert performance (Wang et al., 18 Mar 2025).
Table: Mastermind-Dou Key Results (Excerpt of Table 2, (Wang et al., 18 Mar 2025))
| Model | RLCard Win Rate | DouZero Win Rate |
|---|---|---|
| Mastermind-Dou with prob | 90% | 41% |
| DouZero (expert) | 90% | 43% |
| LLaMA-2-7B (few-shot+sim) | 12% | 3% |
Additionally, post-training on Doudizhu data yielded improved performance on BIG-Bench Hard reasoning tasks, though some catastrophic forgetting appeared on spatial/date subdomains.
3. Mastermind-Dou and Query Complexity in Black-Peg Mastermind
In combinatorial search, “Mastermind-Dou” (as an Editor's term) refers to the solution of Mastermind with using black-peg queries, resolving an open efficiency gap (Martinsson et al., 2020).
Problem Statement
In -color, -position black-peg Mastermind, the codemaker picks ; codebreaker queries , receiving .
Main Result
For , there exists a randomized algorithm recovering in queries, tight by the entropy lower bound (each query leaks bits, total information bits, yielding lower bound).
Algorithmic Outline
The key innovation is reducing to a “signed-permutation” Mastermind, where the secret is a permutation and queries can freely set positive/negative markers. The core algorithm uses an “information-tree token-sliding” method:
- Binary-tree token sliding: Encode the code positions as leaves of a complete binary tree. For each color, use a “token” propagated from the root to leaves, with queries partitioning at each node to localize the exact position.
- Query Compression: A two-phase approach—preprocessing and solve—partitions and compresses the search; at each recursive step, three independent queries are collapsed into two via a Cantor–Mills–style linear combination, ensuring total complexity.
- Key Lemmas: Existential results for “zero” and “distinct-one” queries (for blanking and uniquely identifying colors), as well as query-combining lemma allowing parallel disjoint query resolution.
Generalization
Extending to arbitrary , the randomized query complexity is:
- (black-white peg)
- (black-only) These results synthesize previous bounds [Chvátal 1983, Doerr et al. 2016].
4. Cross-Domain Methodological Parallels
While the three usages of Mastermind-Dou target unrelated problems, common patterns can be abstracted:
- Hierarchical/Recursive Planning: All apply multi-level planning or recursive task decomposition—binary tree token sliding, high-level/low-level adversarial planning, or multi-stage Doudizhu move selection.
- Combining Information Efficiently: Exploiting the informational content of each action (query, prompt, or move) and adaptively focusing resources via probability mass, reflection, or tree-partitioning.
- Closed-Loop Feedback: Each system (query complexity, LLM game reasoning, adversarial jailbreaks) incorporates feedback—either via information-theoretic bounds, loss surfaces, or explicit success/failure scoring—into iterative refinement.
5. Impact and Benchmarking
Mastermind-Dou establishes new benchmarks across all three domains:
- Combinatorial Search: First linear-query complexity for Mastermind, closing a decades-old open gap and yielding tight bounds for arbitrary parameter regimes (Martinsson et al., 2020).
- LLM Game Competency: Matching RL experts in Doudizhu action accuracy and win-rate, validating algorithmic data synthesis as a paradigm for LLM deployment in imperfect-information games (Wang et al., 18 Mar 2025).
- Jailbreak Adversariality: State-of-the-art attack effectiveness on LLMs under advanced defenses, generalizing across open and closed-source targets and outperforming strong baselines (Li et al., 9 Jan 2026).
6. Technical Case Studies and Pseudocode
Doudizhu LLM Pipeline Skeleton (Wang et al., 18 Mar 2025):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
for each trajectory in expert_games: for state s in trajectory: A_legal = all_legal_moves(s) pi_Q = softmax(Q(s,a) for a in A_legal) A_p = top_p(pi_Q, threshold=0.25) # Stage 1: Action prediction prompt1 = {..., actions=A_p} teach_LLM_action_prediction(prompt1) for a in A_p: # Stage 2: Opponent prediction prompt2 = augment(prompt1, a) teach_OppHead_prediction(prompt2) # Stage 3: Action selection prompt3 = compose(prompt1, all_opp_predictions) teach_LLM_final_action(prompt3) |
Mastermind-Dou Planning Loop (Li et al., 9 Jan 2026):
1 2 3 4 5 6 7 8 |
def PLAN(q_harm, S_ret): H = retrieve_high_level_objectives(q_harm, S_ret) T = retrieve_low_level_tactics(q_harm, S_ret) P = [] for eta in H: T_eta = select_tactics(T, eta) P.append((eta, T_eta)) return P |
7. References
- (Martinsson et al., 2020) "Mastermind with a Linear Number of Queries" (Martinsson & Su), query complexity for Mastermind.
- (Wang et al., 18 Mar 2025) "Empowering LLMs in Decision Games through Algorithmic Data Synthesis," details the Doudizhu LLM agent architecture and performance.
- (Li et al., 9 Jan 2026) "Knowledge-Driven Multi-Turn Jailbreaking on LLMs," describes Mastermind-Dou for adversarial LLM exploitation.
A plausible implication is that the Mastermind-Dou naming convention will persist as a marker for technically sophisticated, feedback-driven, and adversarially optimized agents in combinatorial, game-theoretic, and red-teaming domains.