Papers
Topics
Authors
Recent
Search
2000 character limit reached

Mastermind-Dou: Multi-Domain Strategies

Updated 5 February 2026
  • Mastermind-Dou is a multifaceted construct that integrates adversarial jailbreak frameworks, LLM-based game decision making for Doudizhu, and linear-query algorithms for efficient Mastermind solving.
  • In its adversarial jailbreak application, it employs multi-turn, hierarchical planning and feedback loops to achieve high attack success rates against leading LLM defenses.
  • As a game agent and combinatorial solver, Mastermind-Dou leverages expert trajectory synthesis and binary-tree token sliding to reach up to 90% action accuracy and O(n) query efficiency.

Mastermind-Dou encompasses a set of technically distinct but nomenclaturally related constructs at the intersection of combinatorial search, adversarial language modeling, and game-theoretic deep learning. The term “Mastermind-Dou” appears in three principal domains: (1) as the codename for an LLM-based Doudizhu card game agent, (2) as a designation for a sharply optimal algorithm in black-peg Mastermind where the alphabet and code length coincide, and (3) as an instantiation of a self-improving, multi-turn jailbreak framework for LLMs. These usages exhibit no historical linkage but share a methodological emphasis on planning in adversarial or imperfect information environments.

1. Mastermind-Dou in Adversarial Jailbreaking of LLMs

Mastermind-Dou serves as an advanced, knowledge-driven multi-turn jailbreak agent, engineered for maximally effective evasion of state-of-the-art LLM defenses and the controlled induction of harmful outputs (Li et al., 9 Jan 2026). The framework operationalizes adversarial red teaming as a multi-turn, closed-loop Markovian process over conversation histories.

Formal Structure

  • State Definition: At turn tt, the state is st=(Ht,qharm,O)s_t = (H_t, q_{\mathrm{harm}}, O) with HtH_t denoting the sequence of user–assistant pairs, qharmq_{\mathrm{harm}} the harmful seed query, and OO the target objective.
  • Planning: A Planner PP outputs a multi-step plan P=(π1,...,πM)\mathcal{P} = (\pi_1, ..., \pi_M), each πi\pi_i representing a high-level adversarial sub-goal (e.g., persona adoption, masking intent).
  • Execution: An Executor EE generates the current prompt ut=E(Ht1,πc(t))u_t = E(H_{t-1}, \pi_{c(t)}).
  • Control and Success Evaluation: A Controller CC determines if response rtr_t advances πc(t)\pi_{c(t)}, refining or aborting as necessary. Success is declared when judge score JtJ_t surpasses a threshold.

Hierarchical Planning and Knowledge Integration

  • Hierarchical Milestones: High-level objectives H={η1,...}H = \{\eta_1, ...\} and low-level tactics T={τ1,...}T = \{\tau_1, ...\} jointly optimize P\mathcal{P} via a loss that balances objective alignment (obj\ell_{\mathrm{obj}}) and coherence (coh\ell_{\mathrm{coh}}).
  • Repository Formalism: Mastermind-Dou continually refines a knowledge repository KK of reusable adversarial patterns, updated via feedback-driven extraction and pruning.

Closed-Loop Adaptation

Reflection RR remediates failed plans by optimizing for minimal redo errors and preservation of successful priors. Dynamic recombination uses a binary encoding of tactics and evolutionary operators (crossover, mutation, selection proportional to vulnerability oracle feedback) to efficiently navigate tactic combinatorics.

Empirical Impact

On HarmBench and StrongReject, Mastermind-Dou achieved attack success rates (ASR) of 67% (Claude 3.7 Sonnet) and 60% (GPT-5), outperforming X-Teaming and maintaining robustness even under advanced LLM defenses. Harmfulness ratings (HR) were also highest among tested baselines (Li et al., 9 Jan 2026).

2. Mastermind-Dou as the LLM-Based Doudizhu Agent

In LLM-empowered decision-making, Mastermind-Dou is a specialized agent for the 3-player imperfect-information card game Doudizhu. It combines algorithmic data synthesis with multi-head LLM finetuning to match or surpass state-of-the-art RL and rule-based agents (Wang et al., 18 Mar 2025).

Data Synthesis Pipeline

  • Expert Trajectory Generation: Synthetic state-action trajectories are generated with three expert agents: RLCard’s rule-based policy, a supervised human-data mimic, and DouZero (Q-learning expert).
  • Top-pp Filtering: At each state ss, actions are scored via DouZero’s Q(s,a)Q(s,a) network and filtered to the minimal set ApA_p covering cumulative probability 0.25\ge0.25, restricting the action prediction space.
  • Imperfect-Information Modeling: For each candidate move aa, downstream responses of the next two agents are recorded, introducing an opponent-strategy prediction head π^opp(as,a)\hat\pi_{\mathrm{opp}}(a'\mid s,a) trained with cross-entropy.

Model and Training

  • Base: LLaMA-2-7B, LoRA (rank 32, α=64\alpha=64), 8×A100 GPUs.
  • Input Encoding: Cards as integers (e.g., 333 \rightarrow 3, 2172 \rightarrow 17, Jokers up to $30$); action lists are sorted int-encoded vectors.
  • Heads: (1) Possible Action Prediction (ranking next action, token-by-token), (2) Opponent Strategy Prediction (linear layer for aa' probability).
  • Loss: L=LSFT+λLopp\mathcal L= \mathcal L_{\mathrm{SFT}} + \lambda\mathcal L_{\mathrm{opp}} (λ=1\lambda=1), where

LSFT=1Ni=1Nlogpθ(YiXi),Lopp=1Mj=1Ma1[aj=a]logπ^opp(asj,aj)\mathcal L_\text{SFT} = -\frac{1}{N} \sum_{i=1}^N \log p_\theta(Y_i\mid X_i),\quad \mathcal L_\text{opp} = -\frac{1}{M}\sum_{j=1}^M\sum_{a'} \mathbf{1}[a'_j=a']\,\log\,\hat\pi_\text{opp}(a'\mid s_j,a_j)

Empirical Results

Mastermind-Dou with probability-chain outperformes strong baselines and non-expert LLMs by a wide margin. Action accuracy reaches 90%, with win rates as landlord versus RLCard and DouZero of 90% and 41% respectively—matching DouZero’s expert performance (Wang et al., 18 Mar 2025).

Table: Mastermind-Dou Key Results (Excerpt of Table 2, (Wang et al., 18 Mar 2025))

Model RLCard Win Rate DouZero Win Rate
Mastermind-Dou with prob 90% 41%
DouZero (expert) 90% 43%
LLaMA-2-7B (few-shot+sim) 12% 3%

Additionally, post-training on Doudizhu data yielded improved performance on BIG-Bench Hard reasoning tasks, though some catastrophic forgetting appeared on spatial/date subdomains.

3. Mastermind-Dou and Query Complexity in Black-Peg Mastermind

In combinatorial search, “Mastermind-Dou” (as an Editor's term) refers to the solution of Mastermind with k=nk=n using O(n)O(n) black-peg queries, resolving an open efficiency gap (Martinsson et al., 2020).

Problem Statement

In kk-color, nn-position black-peg Mastermind, the codemaker picks c[n]nc \in [n]^n; codebreaker queries q[n]nq \in [n]^n, receiving bc(q)={i:qi=ci}b_c(q) = |\{i: q_i = c_i\}|.

Main Result

For k=nk=n, there exists a randomized algorithm recovering c[n]nc \in [n]^n in O(n)O(n) queries, tight by the entropy lower bound (each query leaks O(logn)O(\log n) bits, total information nlognn\log n bits, yielding Ω(n)\Omega(n) lower bound).

Algorithmic Outline

The key innovation is reducing to a “signed-permutation” Mastermind, where the secret is a permutation and queries can freely set positive/negative markers. The core algorithm uses an “information-tree token-sliding” method:

  • Binary-tree token sliding: Encode the nn code positions as leaves of a complete binary tree. For each color, use a “token” propagated from the root to leaves, with queries partitioning at each node to localize the exact position.
  • Query Compression: A two-phase approach—preprocessing and solve—partitions and compresses the search; at each recursive step, three independent queries are collapsed into two via a Cantor–Mills–style linear combination, ensuring O(n)O(n) total complexity.
  • Key Lemmas: Existential results for “zero” and “distinct-one” queries (for blanking and uniquely identifying colors), as well as query-combining lemma allowing parallel disjoint query resolution.

Generalization

Extending to arbitrary k,nk, n, the randomized query complexity is:

  • bwmm(n,k)=Θ(nlogk/logn+k/n)bwmm(n,k) = \Theta(n\log k/\log n + k/n) (black-white peg)
  • bmm(n,k)=Θ(nlogk/logn+k)bmm(n,k) = \Theta(n\log k/\log n + k) (black-only) These results synthesize previous bounds [Chvátal 1983, Doerr et al. 2016].

4. Cross-Domain Methodological Parallels

While the three usages of Mastermind-Dou target unrelated problems, common patterns can be abstracted:

  • Hierarchical/Recursive Planning: All apply multi-level planning or recursive task decomposition—binary tree token sliding, high-level/low-level adversarial planning, or multi-stage Doudizhu move selection.
  • Combining Information Efficiently: Exploiting the informational content of each action (query, prompt, or move) and adaptively focusing resources via probability mass, reflection, or tree-partitioning.
  • Closed-Loop Feedback: Each system (query complexity, LLM game reasoning, adversarial jailbreaks) incorporates feedback—either via information-theoretic bounds, loss surfaces, or explicit success/failure scoring—into iterative refinement.

5. Impact and Benchmarking

Mastermind-Dou establishes new benchmarks across all three domains:

  • Combinatorial Search: First linear-query complexity for k=nk=n Mastermind, closing a decades-old open gap and yielding tight bounds for arbitrary parameter regimes (Martinsson et al., 2020).
  • LLM Game Competency: Matching RL experts in Doudizhu action accuracy and win-rate, validating algorithmic data synthesis as a paradigm for LLM deployment in imperfect-information games (Wang et al., 18 Mar 2025).
  • Jailbreak Adversariality: State-of-the-art attack effectiveness on LLMs under advanced defenses, generalizing across open and closed-source targets and outperforming strong baselines (Li et al., 9 Jan 2026).

6. Technical Case Studies and Pseudocode

Doudizhu LLM Pipeline Skeleton (Wang et al., 18 Mar 2025):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
for each trajectory in expert_games:
    for state s in trajectory:
        A_legal  = all_legal_moves(s)
        pi_Q     = softmax(Q(s,a) for a in A_legal)
        A_p      = top_p(pi_Q, threshold=0.25)
        # Stage 1: Action prediction
        prompt1 = {..., actions=A_p}
        teach_LLM_action_prediction(prompt1)
        for a in A_p:
            # Stage 2: Opponent prediction
            prompt2 = augment(prompt1, a)
            teach_OppHead_prediction(prompt2)
        # Stage 3: Action selection
        prompt3 = compose(prompt1, all_opp_predictions)
        teach_LLM_final_action(prompt3)

Mastermind-Dou Planning Loop (Li et al., 9 Jan 2026):

1
2
3
4
5
6
7
8
def PLAN(q_harm, S_ret):
    H = retrieve_high_level_objectives(q_harm, S_ret)
    T = retrieve_low_level_tactics(q_harm, S_ret)
    P = []
    for eta in H:
        T_eta = select_tactics(T, eta)
        P.append((eta, T_eta))
    return P

7. References

  • (Martinsson et al., 2020) "Mastermind with a Linear Number of Queries" (Martinsson & Su), query complexity for k=nk=n Mastermind.
  • (Wang et al., 18 Mar 2025) "Empowering LLMs in Decision Games through Algorithmic Data Synthesis," details the Doudizhu LLM agent architecture and performance.
  • (Li et al., 9 Jan 2026) "Knowledge-Driven Multi-Turn Jailbreaking on LLMs," describes Mastermind-Dou for adversarial LLM exploitation.

A plausible implication is that the Mastermind-Dou naming convention will persist as a marker for technically sophisticated, feedback-driven, and adversarially optimized agents in combinatorial, game-theoretic, and red-teaming domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mastermind-Dou.