Papers
Topics
Authors
Recent
Search
2000 character limit reached

Tree of Attacks with Pruning (TAP)

Updated 18 June 2026
  • Tree of Attacks with Pruning (TAP) is an automated adversarial prompting framework that uses breadth-first tree search and chain-of-thought refinements to generate effective jailbreak prompts for LLMs.
  • It systematically expands candidate prompts and prunes off-topic or low-scoring ones using evaluator LLMs, significantly reducing query requirements while boosting success rates.
  • Empirical results demonstrate TAP’s high efficiency, achieving up to 90% jailbreak success on models like GPT-4 with notably fewer queries compared to previous methods.

Tree of Attacks with Pruning (TAP) is an automated adversarial prompting framework designed to jailbreak LLMs using only black-box access. TAP systematically explores the prompt space via breadth-first tree search, injecting chain-of-thought candidate refinements, and applies targeted pruning to maximize efficiency and attack success rates. The TAP methodology and its derivatives have established new empirical state-of-the-art results for automated black-box jailbreaks, and provide a foundation for efficient red-teaming and adversarial evaluation of LLM safety mechanisms (Mehrotra et al., 2023).

1. Algorithmic Structure of the TAP Framework

TAP instantiates a breadth-first “tree-of-thought” attack search, where each node in the search tree corresponds to a candidate attack prompt or partial dialogue, and the edges correspond to prompt refinements or dialogue continuations supplied by an attacker LLM. For single-turn jailbreaks, TAP proceeds as follows:

  • Branch Expansion: At each tree depth ii (up to max depth dd), each current leaf prompt is expanded into bb new prompts using an attacker LLM AA, each representing a single-step chain-of-thought refinement.
  • Prune I (Off-topic Pruning): Candidate prompts are discarded if deemed off-topic relative to the original goal GG, as determined by an evaluator LLM EE.
  • Query and Assessment: Surviving candidates are sent to the target LLM TT; responses RR are collected and scored via a Judge function (also implemented by EE).
  • Termination and Prune II (Width Control): If a score indicates a successful jailbreak, the process halts with success. If not, and more than ww candidate leaves remain, only the dd0 highest-scoring leaves are retained.

The process repeats until a jailbreak is achieved or the maximum tree depth is exceeded. TAP key parameters are tree depth dd1, branching factor dd2, and max width dd3; principal experiments found dd4, dd5, dd6 effective (Mehrotra et al., 2023).

For multi-turn jailbreaks and complex dialogue settings, as in DialTree-RPO, TAP generalizes to dialogue trees: nodes correspond to interleaved attacker-target utterances, and branching, evaluation, and pruning proceed at each turn (Guo et al., 2 Oct 2025).

2. Candidate-Prompt and Dialogue Refinement

Prompt refinement within TAP is attacker-driven and informed by chain-of-thought analysis. For each node, the attacker LLM dd7 processes the conversation history and outputs a JSON object of the form:

AA5

The improvement field is a natural-language diagnosis of why prior prompts failed, while dd8 is an evolved prompt crafted to evade safety filters. Candidate prompts seek to maximize expected jailbreak success, subject to meaning-preservation and topicality:

dd9

where bb0 denotes meaningful prompts and bb1 measures whether the target output constitutes a jailbreak (Mehrotra et al., 2023).

In multi-turn scenarios such as DialTree-RPO, each node encompasses a full dialogue history. The attacker policy bb2 samples bb3 dialogue continuations per context, enabling exploration of complex strategies over multiple turns (Guo et al., 2 Oct 2025).

3. Pruning Strategies

TAP deploys two primary pruning mechanisms to manage the combinatorial growth of the search tree and focus exploration:

  • Phase I: Off-topic Pruning OffTopicbb4 is a binary predicate implemented by evaluator LLM bb5 using an explicit prompt (“Does bb6 request the same information as bb7? YES/NO.”). Any bb8 flagged as off-topic is discarded prior to querying the target.
  • Phase II: Width Control / Top-bb9 Pruning After scoring candidate prompts or partial dialogues, if more than AA0 survivors remain, only the AA1 with the largest Judge-scores are retained.

No continuous scoring threshold is used beyond off-topic filtering and score-based ranking. In multi-turn or RL-based variants (e.g., DialTree-RPO), pruning also includes format validation and, optionally, NLI topic entailment checks and stochastic subsampling to maintain bounded width at each tree level (Guo et al., 2 Oct 2025).

4. Query Efficiency and Theoretical Bounds

Without pruning or early termination, the total number of black-box queries incurred by TAP is bounded by:

AA2

Empirically, aggressive off-topic pruning (∼50% per layer) and early stopping (on jailbreak success) reduce query requirements significantly relative to prior work (Mehrotra et al., 2023). On the AdvBench Subset and GPT-4 target, TAP required an average of AA328.8 queries per jailbreak, improving over the sequential PAIR baseline (∼39.6 queries), while achieving substantially higher jailbreak success rates (∼90% on GPT-4 with TAP vs. ∼60% for PAIR). When extended to multi-turn dialogue, DialTree-RPO achieved even higher attack success rates with fewer queries on most model targets (Guo et al., 2 Oct 2025).

5. Empirical Results and Comparative Evaluation

TAP was evaluated on standardized adversarial benchmarks, including the AdvBench Subset (50 goals, 32 categories) and held-out sets. Target models spanned open-source (Vicuna-v1.5, Llama-7B), closed-source (GPT-3.5, GPT-4, GPT-4-Turbo, PaLM-2, Gemini-Pro), and protected variants wrapped with LlamaGuard.

Method GPT4 (ASR) GPT4-Turbo (ASR) Queries (GPT4)
TAP 90% 84% 28.8
PAIR 60% 44% 39.6
GCG (white-box)
DialTree-RPO 85.3%* ∼3*

* DialTree-RPO achieves 85.3% ASR on average across 10 models using ∼3 queries per attack, outperforming TAP’s 42.6% average success. GCG requires hundreds of thousands of queries (open-source only) (Mehrotra et al., 2023, Guo et al., 2 Oct 2025).

TAP surpasses prior black-box methods in both efficiency and efficacy, and remains robust against state-of-the-art guardrails such as LlamaGuard.

6. Extensions, Limitations, and Prospects

Limitations

  • Evaluator LLM Dependence: TAP’s pruning and success evaluation are bottlenecked by the strength of the evaluator LLM. Substituting weaker LLMs (e.g., GPT-3.5) or heuristics leads to pronounced performance drops.
  • Dataset Generalizability: TAP’s empirical gains are validated on established harm benchmarks; transferability to unseen or orthogonal goal types (privacy, bias) is unproven.
  • Black-box Constraints: Only the first AA4 tokens from the target output are observed, limiting visibility into streaming or filtered output modes.
  • Static Attacker Policy: TAP’s attacker is not adapted online; learning-based or fine-tuned attackers could improve exploration.

Extensions and Future Directions

  • Specialized Evaluators: Fine-tuning small LLMs for harm-specific evaluation could replace the current reliance on large proprietary evaluators.
  • Multi-Prompt and Dialogue Attacks: Extending TAP to sequences of adaptive prompts or multi-turn dialogues has demonstrated further gains (as in DialTree-RPO) (Guo et al., 2 Oct 2025).
  • Adversarial Red-Teaming: TAP-generated adversarial prompts offer valuable data for proactive defense and continual robustification of LLM guardrails.
  • Alternative Branching/Selection Mechanisms: Learning- or UCB-based subtree selection methods could potentially further improve search efficiency.

7. Significance and Synthesis

TAP unifies breadth-first tree search, chain-of-thought guided prompt evolution, and targeted pruning into an automated black-box jailbreak discovery framework. By balancing exploration and efficiency, TAP identifies diverse, interpretable prompts that defeat robust LLM safety mechanisms at high rates with modest query budgets. Multi-turn extensions such as DialTree-RPO highlight the continued vulnerability of LLMs to sophisticated adversarial prompting and motivate further work on defense-oriented evaluation and mitigation strategies (Mehrotra et al., 2023, Guo et al., 2 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Tree of Attacks with Pruning (TAP).