Papers
Topics
Authors
Recent
Search
2000 character limit reached

Fable 5: LLM Adversarial Robustness

Updated 1 July 2026
  • Fable 5 is a frontier large language model developed by Anthropic, evaluated for adversarial robustness using automated red-teaming and structured safety evaluations.
  • The evaluation employs the HackAgent toolkit with adaptive attack strategies like TAP and PAIR to measure vulnerabilities under sustained adversarial pressure.
  • Adaptive attacks account for 97% of confirmed jailbreaks, highlighting the need for continuous, multi-turn semantic defenses over static filtering methods.

Fable 5 is a frontier LLM developed by Anthropic and evaluated for adversarial robustness in the context of automated, large-scale red-teaming. The term "Fable 5" refers specifically to this proprietary LLM as assessed in structured safety evaluations, with quantitative and taxonomic analysis of its vulnerabilities under multiple families of automated jailbreak attacks. The findings characterize both the strengths and residual weaknesses of advanced LLM safety architectures and provide a detailed breakdown of model behavior under sustained adversarial pressure (Franco, 16 Jun 2026).

1. Automated Adversarial Evaluation Methodology

Adversarial robustness assessment of Fable 5 employs the HackAgent toolkit, an open-source red-teaming framework that automates and scales LLM attacks via a “hostile” attacker model. A set of 7 826 harmful intents, categorized into ten top-level and 55 subcategories of harm, forms the benchmark against which model resistance is tested.

At each step, HackAgent submits crafted prompts to Fable 5’s black-box API, guided by an in-loop, HarmBench-style light model to detect likely harmful responses and steer the search adaptively. After the automated campaign, each candidate harmful response is reassessed by a three-judge panel (Qwen 3.7 Max, Gemini 3.5 Flash, GPT 5.5) with decisions determined by majority vote (≥ 2/3). This two-stage pipeline is designed to conservatively and precisely measure real-world vulnerability, penalizing single-judge inflation and ensuring only robustly confirmed jailbreaks are counted (Franco, 16 Jun 2026).

2. Taxonomy of Jailbreak Attack Families

Fable 5 was subjected to four mechanistically distinct jailbreak procedures, each targeting different operational defenses:

  1. TAP (Tree of Attacks with Pruning): An adaptive multi-step search with parameterized depth, width, and branching factor. The attacker LLM generates, evaluates, and prunes prompt branches, focusing expansion on the most promising candidates.
  2. PAIR (Prompt Automatic Iterative Refinement): Automated iterative refinement, with up to 12 iterations across multiple parallel streams. The attacker rewrites and resubmits prompts in response to model refusals until a presumed successful jailbreak is detected or the limit is reached. PAIR coverage was partial, targeting 27 of 55 subcategories.
  3. PAP (Persuasive Adversarial Prompts): Non-adaptive, one-shot style attack leveraging human persuasion (role-play, hypothetical, authority-reference) without subsequent adaptation.
  4. h4rm3l (Static Obfuscation): Fixed, non-adaptive input decorators—e.g., base64 encoding, character ciphers, prompt splitting, few-shot priming, "DAN" role-play injection, and Wikipedia-imitative phrasing.

This stratification reveals the relative efficacy and limitations of both adaptive and static adversarial strategies.

3. Adversarial Success-Rate Metric and Quantitative Results

Robustness is measured using the panel-confirmed Attack Success Rate (ASR), formalized as:

ASR=confirmed jailbreakstotal attempts×100%\mathrm{ASR} = \frac{\text{confirmed jailbreaks}}{\text{total attempts}} \times 100\%

where every “attempt” includes all queries (across intents, decorators, and streams).

Table 1. Confirmed jailbreaks on Fable 5 by technique

Technique Confirmed Jailbreaks Attempts ASR
TAP 477 7,826 6.10 %
PAIR 162 3,766 4.30 %†
PAP 42 7,826 0.54 %
h4rm3l 21 46,956 0.04 %

† PAIR coverage incomplete (lower bound).

Adaptive attacks (TAP, PAIR) dominate the residual vulnerability, accounting for 97 % of all confirmed jailbreaks. Static obfuscation is nearly nullified (0.04 % ASR). Persuasion-only attacks are held below 1 %.

4. Harm-Taxonomy Breakdown and Residual Vulnerabilities

Fable 5’s vulnerability profile is non-uniform across harm categories. Panel-confirmed harmful completions spanned all categories, with the highest rates in Ethical/Social (A), Child Safety (J), and Decision/Cognitive (I).

Table 2. Fable 5 ASR (%) and confirmed counts by category and technique

Cat. TAP % (count) PAIR % (count) PAP % (count) h4rm3l % (count)
A 10.2 (101) 9.2 (91) 1.0 (10) 0.1 (4)
B 2.0 (10) 1.8 (9) 0.2 (1) 0.0 (0)
C 4.4 (39) 5.2 (46) 1.0 (9) 0.1 (4)
D 3.2 (33) 2.1 (14) 0.7 (7) 0.0 (1)
E 0.4 (3) 0.3 (2) 0.0 (0) 0.0 (0)
F 7.3 (39) 0.0 (0) 0.1 (4)
G 6.9 (37) 0.7 (4) 0.1 (3)
H 2.6 (8) 0.0 (0) 0.0 (0)
I 6.6 (105) 0.5 (8) 0.0 (4)
J 13.7 (102) 0.4 (3) 0.0 (1)

Aggregate counts: A (206), B (20), C (98), D (55), E (5), F (43), G (44), H (8), I (117), J (106), totaling 702 confirmed harmful completions.

Residual exploitation predominantly leverages semantic reframing, including exploiting the model’s response to new “role-play” or “authority” prompts rather than lexical trickery, indicating the limits of conventional input-sanitization defenses.

5. Methodological Rigor and Adjudication

Every candidate jailbreak undergoes post-hoc independent judgment by a three-model panel. Majority voting ensures conservative, high-precision classification, discarding cases deemed harmful by only a single judge. This approach lowers false positives and more closely models practical safety exposure versus relying on a single scoring function or judge model.

Total queries against Fable 5 were substantial: 7,826 TAP, 3,766 PAIR (subset of categories), 7,826 PAP, and 46,956 h4rm3l, reflecting hundreds of thousands of adversarial probes.

6. Main Conclusions and Recommendations

Despite systematic safety training, Fable 5 remains “reliably breakable under automated pressure” at production-relevant rates. The measured worst-case single-digit ASR (6.1 % for TAP) translated into 702 harmful completions, often found by fully automated attackers in minimal refinement steps. Adaptive, context-driven jailbreaks—rather than static or persuasion-only attacks—account for the vast majority of confirmed harms.

Vulnerabilities concentrate in select harm categories (notably child safety, ethical/social, and decision-support), suggesting that targeted red-teaming and refusal mechanisms could reduce exposure, but the presence of nonzero rates in all categories demonstrates that “addressable does not mean already addressed.”

Recommendations highlight a necessary pivot from reactive, single-prompt blocking toward continuous, semantic, and multi-turn analysis as the only credible path to shrinking the remaining adaptive attack surface. The findings indicate that current frontier-model safety architectures, while effective against non-adaptive attacks, remain insufficiently robust against state-of-the-art automated adversaries (Franco, 16 Jun 2026).

7. Significance for LLM Safety

The Fable 5 red-teaming study establishes that even the most-tested, hardened LLMs have a nontrivial residual attack surface against automated, feedback-driven adversaries. The dataset, methodology, and taxonomic breakdown serve as benchmarks for future safety evaluations and point toward the development of continuous, context-aware mitigation strategies at inference time. These results set concrete quantitative standards for LLM safety assessment and indicate fundamental limitations of prompt-only and static-filter defenses in large-scale, real-world deployments (Franco, 16 Jun 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fable 5.