Tree of Attacks with Pruning (TAP)
- Tree of Attacks with Pruning (TAP) is an automated adversarial prompting framework that uses breadth-first tree search and chain-of-thought refinements to generate effective jailbreak prompts for LLMs.
- It systematically expands candidate prompts and prunes off-topic or low-scoring ones using evaluator LLMs, significantly reducing query requirements while boosting success rates.
- Empirical results demonstrate TAP’s high efficiency, achieving up to 90% jailbreak success on models like GPT-4 with notably fewer queries compared to previous methods.
Tree of Attacks with Pruning (TAP) is an automated adversarial prompting framework designed to jailbreak LLMs using only black-box access. TAP systematically explores the prompt space via breadth-first tree search, injecting chain-of-thought candidate refinements, and applies targeted pruning to maximize efficiency and attack success rates. The TAP methodology and its derivatives have established new empirical state-of-the-art results for automated black-box jailbreaks, and provide a foundation for efficient red-teaming and adversarial evaluation of LLM safety mechanisms (Mehrotra et al., 2023).
1. Algorithmic Structure of the TAP Framework
TAP instantiates a breadth-first “tree-of-thought” attack search, where each node in the search tree corresponds to a candidate attack prompt or partial dialogue, and the edges correspond to prompt refinements or dialogue continuations supplied by an attacker LLM. For single-turn jailbreaks, TAP proceeds as follows:
- Branch Expansion: At each tree depth (up to max depth ), each current leaf prompt is expanded into new prompts using an attacker LLM , each representing a single-step chain-of-thought refinement.
- Prune I (Off-topic Pruning): Candidate prompts are discarded if deemed off-topic relative to the original goal , as determined by an evaluator LLM .
- Query and Assessment: Surviving candidates are sent to the target LLM ; responses are collected and scored via a Judge function (also implemented by ).
- Termination and Prune II (Width Control): If a score indicates a successful jailbreak, the process halts with success. If not, and more than candidate leaves remain, only the 0 highest-scoring leaves are retained.
The process repeats until a jailbreak is achieved or the maximum tree depth is exceeded. TAP key parameters are tree depth 1, branching factor 2, and max width 3; principal experiments found 4, 5, 6 effective (Mehrotra et al., 2023).
For multi-turn jailbreaks and complex dialogue settings, as in DialTree-RPO, TAP generalizes to dialogue trees: nodes correspond to interleaved attacker-target utterances, and branching, evaluation, and pruning proceed at each turn (Guo et al., 2 Oct 2025).
2. Candidate-Prompt and Dialogue Refinement
Prompt refinement within TAP is attacker-driven and informed by chain-of-thought analysis. For each node, the attacker LLM 7 processes the conversation history and outputs a JSON object of the form:
5
The improvement field is a natural-language diagnosis of why prior prompts failed, while 8 is an evolved prompt crafted to evade safety filters. Candidate prompts seek to maximize expected jailbreak success, subject to meaning-preservation and topicality:
9
where 0 denotes meaningful prompts and 1 measures whether the target output constitutes a jailbreak (Mehrotra et al., 2023).
In multi-turn scenarios such as DialTree-RPO, each node encompasses a full dialogue history. The attacker policy 2 samples 3 dialogue continuations per context, enabling exploration of complex strategies over multiple turns (Guo et al., 2 Oct 2025).
3. Pruning Strategies
TAP deploys two primary pruning mechanisms to manage the combinatorial growth of the search tree and focus exploration:
- Phase I: Off-topic Pruning OffTopic4 is a binary predicate implemented by evaluator LLM 5 using an explicit prompt (“Does 6 request the same information as 7? YES/NO.”). Any 8 flagged as off-topic is discarded prior to querying the target.
- Phase II: Width Control / Top-9 Pruning After scoring candidate prompts or partial dialogues, if more than 0 survivors remain, only the 1 with the largest Judge-scores are retained.
No continuous scoring threshold is used beyond off-topic filtering and score-based ranking. In multi-turn or RL-based variants (e.g., DialTree-RPO), pruning also includes format validation and, optionally, NLI topic entailment checks and stochastic subsampling to maintain bounded width at each tree level (Guo et al., 2 Oct 2025).
4. Query Efficiency and Theoretical Bounds
Without pruning or early termination, the total number of black-box queries incurred by TAP is bounded by:
2
Empirically, aggressive off-topic pruning (∼50% per layer) and early stopping (on jailbreak success) reduce query requirements significantly relative to prior work (Mehrotra et al., 2023). On the AdvBench Subset and GPT-4 target, TAP required an average of 328.8 queries per jailbreak, improving over the sequential PAIR baseline (∼39.6 queries), while achieving substantially higher jailbreak success rates (∼90% on GPT-4 with TAP vs. ∼60% for PAIR). When extended to multi-turn dialogue, DialTree-RPO achieved even higher attack success rates with fewer queries on most model targets (Guo et al., 2 Oct 2025).
5. Empirical Results and Comparative Evaluation
TAP was evaluated on standardized adversarial benchmarks, including the AdvBench Subset (50 goals, 32 categories) and held-out sets. Target models spanned open-source (Vicuna-v1.5, Llama-7B), closed-source (GPT-3.5, GPT-4, GPT-4-Turbo, PaLM-2, Gemini-Pro), and protected variants wrapped with LlamaGuard.
| Method | GPT4 (ASR) | GPT4-Turbo (ASR) | Queries (GPT4) |
|---|---|---|---|
| TAP | 90% | 84% | 28.8 |
| PAIR | 60% | 44% | 39.6 |
| GCG (white-box) | – | – | – |
| DialTree-RPO | 85.3%* | – | ∼3* |
* DialTree-RPO achieves 85.3% ASR on average across 10 models using ∼3 queries per attack, outperforming TAP’s 42.6% average success. GCG requires hundreds of thousands of queries (open-source only) (Mehrotra et al., 2023, Guo et al., 2 Oct 2025).
TAP surpasses prior black-box methods in both efficiency and efficacy, and remains robust against state-of-the-art guardrails such as LlamaGuard.
6. Extensions, Limitations, and Prospects
Limitations
- Evaluator LLM Dependence: TAP’s pruning and success evaluation are bottlenecked by the strength of the evaluator LLM. Substituting weaker LLMs (e.g., GPT-3.5) or heuristics leads to pronounced performance drops.
- Dataset Generalizability: TAP’s empirical gains are validated on established harm benchmarks; transferability to unseen or orthogonal goal types (privacy, bias) is unproven.
- Black-box Constraints: Only the first 4 tokens from the target output are observed, limiting visibility into streaming or filtered output modes.
- Static Attacker Policy: TAP’s attacker is not adapted online; learning-based or fine-tuned attackers could improve exploration.
Extensions and Future Directions
- Specialized Evaluators: Fine-tuning small LLMs for harm-specific evaluation could replace the current reliance on large proprietary evaluators.
- Multi-Prompt and Dialogue Attacks: Extending TAP to sequences of adaptive prompts or multi-turn dialogues has demonstrated further gains (as in DialTree-RPO) (Guo et al., 2 Oct 2025).
- Adversarial Red-Teaming: TAP-generated adversarial prompts offer valuable data for proactive defense and continual robustification of LLM guardrails.
- Alternative Branching/Selection Mechanisms: Learning- or UCB-based subtree selection methods could potentially further improve search efficiency.
7. Significance and Synthesis
TAP unifies breadth-first tree search, chain-of-thought guided prompt evolution, and targeted pruning into an automated black-box jailbreak discovery framework. By balancing exploration and efficiency, TAP identifies diverse, interpretable prompts that defeat robust LLM safety mechanisms at high rates with modest query budgets. Multi-turn extensions such as DialTree-RPO highlight the continued vulnerability of LLMs to sophisticated adversarial prompting and motivate further work on defense-oriented evaluation and mitigation strategies (Mehrotra et al., 2023, Guo et al., 2 Oct 2025).