Mastermind-Dou: Multi-Domain Strategies

Updated 5 February 2026

Mastermind-Dou is a multifaceted construct that integrates adversarial jailbreak frameworks, LLM-based game decision making for Doudizhu, and linear-query algorithms for efficient Mastermind solving.
In its adversarial jailbreak application, it employs multi-turn, hierarchical planning and feedback loops to achieve high attack success rates against leading LLM defenses.
As a game agent and combinatorial solver, Mastermind-Dou leverages expert trajectory synthesis and binary-tree token sliding to reach up to 90% action accuracy and O(n) query efficiency.

Mastermind-Dou encompasses a set of technically distinct but nomenclaturally related constructs at the intersection of combinatorial search, adversarial language modeling, and game-theoretic deep learning. The term “Mastermind-Dou” appears in three principal domains: (1) as the codename for an LLM-based Doudizhu card game agent, (2) as a designation for a sharply optimal algorithm in black-peg Mastermind where the alphabet and code length coincide, and (3) as an instantiation of a self-improving, multi-turn jailbreak framework for LLMs. These usages exhibit no historical linkage but share a methodological emphasis on planning in adversarial or imperfect information environments.

1. Mastermind-Dou in Adversarial Jailbreaking of LLMs

Mastermind-Dou serves as an advanced, knowledge-driven multi-turn jailbreak agent, engineered for maximally effective evasion of state-of-the-art LLM defenses and the controlled induction of harmful outputs (Li et al., 9 Jan 2026). The framework operationalizes adversarial red teaming as a multi-turn, closed-loop Markovian process over conversation histories.

Formal Structure

State Definition: At turn $t$ , the state is $s_t = (H_t, q_{\mathrm{harm}}, O)$ with $H_t$ denoting the sequence of user–assistant pairs, $q_{\mathrm{harm}}$ the harmful seed query, and $O$ the target objective.
Planning: A Planner $P$ outputs a multi-step plan $\mathcal{P} = (\pi_1, ..., \pi_M)$ , each $\pi_i$ representing a high-level adversarial sub-goal (e.g., persona adoption, masking intent).
Execution: An Executor $E$ generates the current prompt $u_t = E(H_{t-1}, \pi_{c(t)})$ .
Control and Success Evaluation: A Controller $s_t = (H_t, q_{\mathrm{harm}}, O)$ 0 determines if response $s_t = (H_t, q_{\mathrm{harm}}, O)$ 1 advances $s_t = (H_t, q_{\mathrm{harm}}, O)$ 2, refining or aborting as necessary. Success is declared when judge score $s_t = (H_t, q_{\mathrm{harm}}, O)$ 3 surpasses a threshold.

Hierarchical Planning and Knowledge Integration

Hierarchical Milestones: High-level objectives $s_t = (H_t, q_{\mathrm{harm}}, O)$ 4 and low-level tactics $s_t = (H_t, q_{\mathrm{harm}}, O)$ 5 jointly optimize $s_t = (H_t, q_{\mathrm{harm}}, O)$ 6 via a loss that balances objective alignment ( $s_t = (H_t, q_{\mathrm{harm}}, O)$ 7) and coherence ( $s_t = (H_t, q_{\mathrm{harm}}, O)$ 8).
Repository Formalism: Mastermind-Dou continually refines a knowledge repository $s_t = (H_t, q_{\mathrm{harm}}, O)$ 9 of reusable adversarial patterns, updated via feedback-driven extraction and pruning.

Closed-Loop Adaptation

Reflection $H_t$ 0 remediates failed plans by optimizing for minimal redo errors and preservation of successful priors. Dynamic recombination uses a binary encoding of tactics and evolutionary operators (crossover, mutation, selection proportional to vulnerability oracle feedback) to efficiently navigate tactic combinatorics.

Empirical Impact

On HarmBench and StrongReject, Mastermind-Dou achieved attack success rates (ASR) of 67% (Claude 3.7 Sonnet) and 60% (GPT-5), outperforming X-Teaming and maintaining robustness even under advanced LLM defenses. Harmfulness ratings (HR) were also highest among tested baselines (Li et al., 9 Jan 2026).

2. Mastermind-Dou as the LLM-Based Doudizhu Agent

In LLM-empowered decision-making, Mastermind-Dou is a specialized agent for the 3-player imperfect-information card game Doudizhu. It combines algorithmic data synthesis with multi-head LLM finetuning to match or surpass state-of-the-art RL and rule-based agents (Wang et al., 18 Mar 2025).

Data Synthesis Pipeline

Expert Trajectory Generation: Synthetic state-action trajectories are generated with three expert agents: RLCard’s rule-based policy, a supervised human-data mimic, and DouZero (Q-learning expert).
Top- $H_t$ 1 Filtering: At each state $H_t$ 2, actions are scored via DouZero’s $H_t$ 3 network and filtered to the minimal set $H_t$ 4 covering cumulative probability $H_t$ 5, restricting the action prediction space.
Imperfect-Information Modeling: For each candidate move $H_t$ 6, downstream responses of the next two agents are recorded, introducing an opponent-strategy prediction head $H_t$ 7 trained with cross-entropy.

Model and Training

Base: LLaMA-2-7B, LoRA (rank 32, $H_t$ 8), 8×A100 GPUs.
Input Encoding: Cards as integers (e.g., $H_t$ 9, $q_{\mathrm{harm}}$ 0, Jokers up to $q_{\mathrm{harm}}$ 1); action lists are sorted int-encoded vectors.
Heads: (1) Possible Action Prediction (ranking next action, token-by-token), (2) Opponent Strategy Prediction (linear layer for $q_{\mathrm{harm}}$ 2 probability).
Loss: $q_{\mathrm{harm}}$ 3 ( $q_{\mathrm{harm}}$ 4), where

$q_{\mathrm{harm}}$ 5

Empirical Results

Mastermind-Dou with probability-chain outperformes strong baselines and non-expert LLMs by a wide margin. Action accuracy reaches 90%, with win rates as landlord versus RLCard and DouZero of 90% and 41% respectively—matching DouZero’s expert performance (Wang et al., 18 Mar 2025).

Table: Mastermind-Dou Key Results (Excerpt of Table 2, (Wang et al., 18 Mar 2025))

Model	RLCard Win Rate	DouZero Win Rate
Mastermind-Dou with prob	90%	41%
DouZero (expert)	90%	43%
LLaMA-2-7B (few-shot+sim)	12%	3%

Additionally, post-training on Doudizhu data yielded improved performance on BIG-Bench Hard reasoning tasks, though some catastrophic forgetting appeared on spatial/date subdomains.

3. Mastermind-Dou and Query Complexity in Black-Peg Mastermind

In combinatorial search, “Mastermind-Dou” (as an Editor's term) refers to the solution of Mastermind with $q_{\mathrm{harm}}$ 6 using $q_{\mathrm{harm}}$ 7 black-peg queries, resolving an open efficiency gap (Martinsson et al., 2020).

Problem Statement

In $q_{\mathrm{harm}}$ 8-color, $q_{\mathrm{harm}}$ 9-position black-peg Mastermind, the codemaker picks $O$ 0; codebreaker queries $O$ 1, receiving $O$ 2.

Main Result

For $O$ 3, there exists a randomized algorithm recovering $O$ 4 in $O$ 5 queries, tight by the entropy lower bound (each query leaks $O$ 6 bits, total information $O$ 7 bits, yielding $O$ 8 lower bound).

Algorithmic Outline

The key innovation is reducing to a “signed-permutation” Mastermind, where the secret is a permutation and queries can freely set positive/negative markers. The core algorithm uses an “information-tree token-sliding” method:

Binary-tree token sliding: Encode the $O$ 9 code positions as leaves of a complete binary tree. For each color, use a “token” propagated from the root to leaves, with queries partitioning at each node to localize the exact position.
Query Compression: A two-phase approach—preprocessing and solve—partitions and compresses the search; at each recursive step, three independent queries are collapsed into two via a Cantor–Mills–style linear combination, ensuring $P$ 0 total complexity.
Key Lemmas: Existential results for “zero” and “distinct-one” queries (for blanking and uniquely identifying colors), as well as query-combining lemma allowing parallel disjoint query resolution.

Generalization

Extending to arbitrary $P$ 1, the randomized query complexity is:

$P$ 2 (black-white peg)
$P$ 3 (black-only) These results synthesize previous bounds [Chvátal 1983, Doerr et al. 2016].

4. Cross-Domain Methodological Parallels

While the three usages of Mastermind-Dou target unrelated problems, common patterns can be abstracted:

Hierarchical/Recursive Planning: All apply multi-level planning or recursive task decomposition—binary tree token sliding, high-level/low-level adversarial planning, or multi-stage Doudizhu move selection.
Combining Information Efficiently: Exploiting the informational content of each action (query, prompt, or move) and adaptively focusing resources via probability mass, reflection, or tree-partitioning.
Closed-Loop Feedback: Each system (query complexity, LLM game reasoning, adversarial jailbreaks) incorporates feedback—either via information-theoretic bounds, loss surfaces, or explicit success/failure scoring—into iterative refinement.

5. Impact and Benchmarking

Mastermind-Dou establishes new benchmarks across all three domains:

Combinatorial Search: First linear-query complexity for $P$ 4 Mastermind, closing a decades-old open gap and yielding tight bounds for arbitrary parameter regimes (Martinsson et al., 2020).
LLM Game Competency: Matching RL experts in Doudizhu action accuracy and win-rate, validating algorithmic data synthesis as a paradigm for LLM deployment in imperfect-information games (Wang et al., 18 Mar 2025).
Jailbreak Adversariality: State-of-the-art attack effectiveness on LLMs under advanced defenses, generalizing across open and closed-source targets and outperforming strong baselines (Li et al., 9 Jan 2026).

6. Technical Case Studies and Pseudocode

Doudizhu LLM Pipeline Skeleton (Wang et al., 18 Mar 2025):

$P$ 6

Mastermind-Dou Planning Loop (Li et al., 9 Jan 2026):

$P$ 7

7. References

(Martinsson et al., 2020) "Mastermind with a Linear Number of Queries" (Martinsson & Su), query complexity for $P$ 5 Mastermind.
(Wang et al., 18 Mar 2025) "Empowering LLMs in Decision Games through Algorithmic Data Synthesis," details the Doudizhu LLM agent architecture and performance.
(Li et al., 9 Jan 2026) "Knowledge-Driven Multi-Turn Jailbreaking on LLMs," describes Mastermind-Dou for adversarial LLM exploitation.

A plausible implication is that the Mastermind-Dou naming convention will persist as a marker for technically sophisticated, feedback-driven, and adversarially optimized agents in combinatorial, game-theoretic, and red-teaming domains.

Markdown Report Issue Upgrade to Chat

References (3)

Knowledge-Driven Multi-Turn Jailbreaking on Large Language Models (2026)

Empowering LLMs in Decision Games through Algorithmic Data Synthesis (2025)

Mastermind with a Linear Number of Queries (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mastermind-Dou.

Mastermind-Dou: Multi-Domain Strategies

1. Mastermind-Dou in Adversarial Jailbreaking of LLMs

Formal Structure

Hierarchical Planning and Knowledge Integration

Closed-Loop Adaptation

Empirical Impact

2. Mastermind-Dou as the LLM-Based Doudizhu Agent

Data Synthesis Pipeline

Model and Training

Empirical Results

3. Mastermind-Dou and Query Complexity in Black-Peg Mastermind

Problem Statement

Main Result

Algorithmic Outline

Generalization

4. Cross-Domain Methodological Parallels

5. Impact and Benchmarking

6. Technical Case Studies and Pseudocode

7. References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Mastermind-Dou: Multi-Domain Strategies

1. Mastermind-Dou in Adversarial Jailbreaking of LLMs

Formal Structure

Hierarchical Planning and Knowledge Integration

Closed-Loop Adaptation

Empirical Impact

2. Mastermind-Dou as the LLM-Based Doudizhu Agent

Data Synthesis Pipeline

Model and Training

Empirical Results

3. Mastermind-Dou and Query Complexity in Black-Peg Mastermind

Problem Statement

Main Result

Algorithmic Outline

Generalization

4. Cross-Domain Methodological Parallels

5. Impact and Benchmarking

6. Technical Case Studies and Pseudocode

7. References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research