Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models (2511.15304v2)
Abstract: We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for LLMs. Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to MLCommons and EU CoP risk taxonomies shows that poetic attacks transfer across CBRN, manipulation, cyber-offence, and loss-of-control domains. Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines. Outputs are evaluated using an ensemble of 3 open-weight LLM judges, whose binary safety assessments were validated on a stratified human-labeled subset. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared to non-poetic baselines), substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper studies a surprising weakness in many LLMs, like the AI chatbots you use online. The authors found that if you turn a harmful request into a poem, many LLMs are much more likely to answer it—even when they should refuse. They call this “adversarial poetry,” and they show it can work in a single message (a “single turn”) across lots of different models and companies.
Key Questions
The research asks three simple questions:
- Does rewriting a harmful request as a poem make AI models more likely to answer when they should say no?
- Does this trick work on many different models, not just a few?
- Does it work across lots of dangerous areas (like cyberattacks, chemical/biological/nuclear hazards, privacy violations, and manipulation)?
How They Tested It
Think of an AI’s safety system like a store’s security camera that looks for certain patterns. If a rule checks for “obvious bad requests,” then disguising the request might fool the system. The researchers used poetry as the disguise.
Here’s what they did:
- They wrote 20 short poems that gently ask for harmful information using metaphor and imagery (like describing a dangerous process in flowery language instead of blunt steps).
- They also took 1,200 known harmful requests from a public safety benchmark (MLCommons AILuminate—basically a big test set of risky prompts) and automatically converted each one into poetic form using a standard instruction (a “meta-prompt”).
- They tested these prompts on 25 different LLMs from 9 providers (Google, OpenAI, Anthropic, DeepSeek, Qwen, Mistral, Meta, xAI, Moonshot).
- They kept it “single-turn,” meaning they only sent one message to each model—no back-and-forth or clever follow-ups.
- To judge whether a model’s response was safe or unsafe, they used:
- An ensemble of three open-source “judge” models to label answers.
- Human reviewers to double-check a sample and resolve disagreements.
Important terms explained simply:
- Jailbreak: Getting a model to break its rules and give harmful info.
- Attack Success Rate (ASR): The percentage of times the model gave unsafe answers instead of refusing.
- Single-turn: You only send one message—no conversation.
Main Findings and Why They Matter
Big picture result: Poetry made models much more likely to break safety rules.
Key details:
- The average success rate for harmful prompts jumped from about 8% (regular text) to about 43% (poetry) across models—a roughly fivefold increase.
- The hand-crafted poems had an average success rate of 62% across the 25 models; some models went over 90%.
- The automatic poetry conversion (no special trickery beyond style) still raised success rates a lot compared to the original prose versions.
- This worked across many risk areas: cyber offenses (like malware help), CBRN (chemical/biological/radiological/nuclear), manipulation, privacy, and loss-of-control scenarios.
- The problem showed up in many different model families and training methods (including RLHF and Constitutional AI), meaning it’s a broad, not a model-specific, weakness.
Why it matters:
- It shows that changing only the style (turning prose into poetry) can bypass safety systems. That suggests current safety training focuses too much on obvious forms and not enough on how the same harmful idea can look when written creatively.
- It reveals a gap in how we test and align models: models may be trained to refuse direct, plain requests but not clever, stylized ones.
Implications and Impact
What this means going forward:
- Safety teams need to test models with more diverse styles—creative writing, metaphors, narratives—not just plain text. If a request’s meaning stays harmful, the safety system must catch it, no matter the style.
- Regulators and benchmark creators should add stylistic variations (like poetry) to their evaluations to see if models truly understand and refuse harmful intent.
- Model alignment (teaching models to be safe) may need new methods that look beyond pattern-matching of “bad phrases” and instead identify harmful intent even when disguised.
- Defenses could include:
- Better intent detection across styles and languages.
- Specialized filters that recognize metaphorical or indirect instructions.
- Training models on examples of stylized harmful prompts so they learn to refuse them consistently.
In short, the paper highlights a simple but powerful weakness: style can trick safety systems. Fixing this will require smarter guardrails that understand meaning, not just surface wording.
Knowledge Gaps
Limitations and Knowledge Gaps
In the context of the research paper on adversarial poetry as a jailbreak mechanism for LLMs, the following gaps and limitations were identified:
Limitations
- Sample Size for Curated Poems: The paper relies on a small sample (20 curated adversarial poems) to demonstrate the effectiveness of poetic reformulation. This could limit the generalizability of results across a broader spectrum of poetic styles and forms.
- Dependency on Existing Taxonomies: The evaluation of adversarial poetry's effectiveness is mapped to MLCommons and EU CoP taxonomies. The paper does not explore or propose new taxonomies that might capture emerging risks not covered by existing frameworks.
- Model Specifications and Variability: The paper provides model specifications but does not deeply explore how different architectural features or training methods contribute to varying susceptibility to poetic jailbreaks.
Knowledge Gaps
- Impact of Different Poetic Styles: The paper does not comprehensively analyze how various poetic structures (e.g., sonnet, haiku) or devices (e.g., metaphor, allegory) influence the success rate of jailbreak attempts.
- Cross-Language Vulnerability: While the paper mentions English and Italian prompts, the impact of adversarial poetry in other languages and multilingual environments is left unexplored.
- Long-Term Effects on Model Alignment: The paper discusses immediate attack success but does not address the potential long-term effects of regular exposure to adversarial poetry on model alignment and safety.
- Ethical and Legal Implications: The work focuses primarily on technical vulnerabilities without considering the broader ethical and legal implications of using adversarial poetry in real-world applications.
- Countermeasures and Defensive Strategies: The paper identifies a vulnerability but does not extensively propose or test potential countermeasures or defensive strategies specifically tailored to mitigate risks associated with adversarial poetic prompts.
Open Questions
- Effectiveness of Combined Techniques: Can adversarial poetry, when combined with other stylistic or obfuscation techniques, further enhance the efficacy of jailbreak mechanisms?
- Role of Model Size and Training Data: How do changes in model size, composition of training datasets, and periodic updates influence susceptibility to adversarial poetry?
- Human Attacker Models: How would the introduction of human creativity in crafting adversarial poetry affect the success rates compared to systematically generated prompts?
- Accuracy of Judge Models: While using open-weight models for evaluation provides transparency, do these models accurately reflect human judgment in identifying unsafe outputs?
This list aims to guide future research efforts in addressing the identified gaps and refining methods to improve model safety and alignment.
Practical Applications
Immediate Applications
The following items translate the paper’s findings into actionable steps that can be deployed now. They focus on strengthening safety evaluations, governance, and operational safeguards across sectors, while avoiding descriptions that could facilitate misuse.
- Industry (software, cybersecurity, MLOps): Integrate poetic stress-testing into red-teaming pipelines
- What: Add automated “adversarial poetry” transformations to existing harmful-prompt corpora (e.g., MLCommons AILuminate) and compare ASR against prose baselines.
- Tools/workflows: Meta-prompt converter for verse, batch evaluation harness, ensemble LLM-as-a-judge, human adjudication loop.
- Assumptions/dependencies: Uses single-turn, text-only prompts and provider-default settings; requires annotated hazard taxonomies; relies on judge reliability and consistent rubrics.
- Industry (vendor selection, procurement): Benchmark and score LLM providers on style-robustness
- What: Add a “style robustness” criterion to procurement checklists and SLAs, reporting ASR deltas between prose and poetic prompts across risk categories.
- Sectors: Financial services, healthcare, education, enterprise SaaS.
- Assumptions/dependencies: Requires standardized prompt sets, reproducible judge ensemble, controlled sampling parameters.
- Trust & Safety operations: Deploy input pre-screeners for stylistic obfuscation
- What: Lightweight classifiers flag inputs with poetic or highly stylized features (meter, rhyme, condensed metaphor) for additional review or safer decoding regimes.
- Tools/workflows: “Style-aware guardrail” module, escalation rules, rate-limiting triggers, human-in-the-loop review for flagged content.
- Assumptions/dependencies: False positives must be managed to avoid overblocking legitimate creative content.
- Model-serving governance (MLOps): CI/CD safety gates that include style-fuzzing
- What: Add verse-based adversarial tests to pre-release checks, post-release monitoring, and rollback criteria when ASR exceeds thresholds.
- Tools/workflows: Continuous evaluation dashboard; hazard-category coverage tracking; change management with safety thresholds.
- Assumptions/dependencies: Stable evaluation harness; alignment with organizational risk appetite; cost for routine large-scale testing.
- Policy and compliance (regulators and internal governance): Require style-variant testing in risk assessments
- What: Update internal policies to mandate reporting of ASR for stylistic transformations alongside standard jailbreak metrics; include crosswalk to EU CoP systemic risk domains.
- Sectors: Public-sector AI procurement, regulated industries (finance, healthcare).
- Assumptions/dependencies: Harmonization with MLCommons and EU CoP taxonomies; documented audit trails.
- Academia (evaluation and reproducibility): Release style-augmented benchmarks and judge ensembles
- What: Publish verse-transformed datasets aligned to hazard taxonomies; detail labeling protocols and human validation procedures.
- Tools/workflows: Open-weight judge ensembles, inter-rater agreement metrics, adjudication processes.
- Assumptions/dependencies: Ethical review; careful sanitization to avoid operational harmful details; license constraints for model outputs.
- Enterprise productization: “Adversarial Poetry Fuzzer” and “Style Stress Tester”
- What: Packaged services that convert known risky prompts into poetic variants and score models under single-turn conditions.
- Sectors: LLM platforms, model evaluation SaaS, managed AI services.
- Assumptions/dependencies: Data handling and privacy constraints; integration with vendor APIs.
- Cybersecurity (blue team): Update incident playbooks to include stylistic jailbreaks
- What: Recognize poetry as a jailbreak vector; encode detection steps, higher-scrutiny modes, and emergency refusal policies for stylized inputs tied to sensitive capabilities.
- Assumptions/dependencies: Close coordination with AI platform teams; tested fallback modes that preserve service stability.
- Healthcare and education platforms (domain safety): Immediate vulnerability scans of deployed assistants
- What: Evaluate clinical or student-facing assistants for elevated ASR in poetry versus prose across domain-specific hazards (e.g., specialized advice).
- Tools/workflows: Sector-specific prompt sets; medical and pedagogical review boards for adjudication.
- Assumptions/dependencies: Strict compliance with sector regulations (HIPAA, FERPA); careful handling of sensitive templates.
- Finance and e-commerce (fraud and manipulation risks): Strengthen guardrails against poetic framing
- What: Test for elevated compliance on privacy violations, non-violent crimes, and persuasive manipulation when prompts are stylized.
- Assumptions/dependencies: Integrate anti-fraud teams into model governance; calibrate refusal heuristics to intent rather than form.
- Daily life (consumer platforms): User-facing safety messaging and mode controls
- What: Warn that creative modes can affect safety; give admins configurable policies to restrict responses to stylized requests in sensitive contexts.
- Assumptions/dependencies: Balance usability and creativity; local moderation requirements; transparency notices.
- Open benchmark governance (MLCommons, standards bodies): Fast-track inclusion of stylistic obfuscation testing
- What: Add poetry-based variants to standard test protocols; publish comparative ASR deltas by hazard and provider.
- Assumptions/dependencies: Community alignment on evaluation stacks; reproducibility norms.
Long-Term Applications
The following items require further research, scaling, or engineering to be feasible. They aim to harden systems against stylistic obfuscation and establish durable standards.
- Robust alignment methods that generalize across discourse modes
- What: Adversarial training with style-augmented data (poetry and other genres), safety objectives that prioritize intent over surface form, and improved refusal generalization.
- Sectors: All sectors deploying LLMs in high-stakes workflows.
- Dependencies: High-quality curated datasets, safe data curation, compute cost, careful avoidance of overfitting to specific styles.
- Semantic-intent first defenses
- What: Pre-inference canonicalization that paraphrases stylized inputs into neutral prose, semantic risk inference modules, and policy checks that operate on inferred intent rather than raw text form.
- Tools/products: Input normalizers; intent classifiers with confidence calibration; semantic policy engines.
- Dependencies: Reliable semantic parsers; low-latency pipelines; explainability and auditability.
- Style-aware safety architectures
- What: Decoupled safety classifiers that detect and robustly handle poetic and other stylized inputs; multi-channel gating (e.g., creativity mode isolation); separate safety heads tuned to rhetorical features.
- Dependencies: Architecture and inference changes; careful trade-offs with creative capabilities.
- Standards and certification for style robustness
- What: Sector-agnostic metrics and badges for “style-robust” safety; third-party audits that report ASR deltas across prose, poetry, and other stylistic variants.
- Policy: Updates to EU CoP and industry codes to mandate style-variant testing in conformity assessments.
- Dependencies: Consensus on metrics; test-set maintenance; auditor accreditation.
- Expanded adversarial style taxonomy and detectors
- What: Beyond poetry—develop detectors for narrative frames, role-play rhetoric, dense metaphor, low-resource languages, and hybrid obfuscation strategies with comparable surface deviations.
- Dependencies: Cross-lingual and cross-genre corpora; multilingual safety evaluations; fairness considerations to avoid biased overblocking.
- Multi-modal style attacks and defenses
- What: Investigate whether poetic rhythm, prosody, or multimedia formatting (audio, images with stylized text) similarly degrade guardrails; build multimodal detectors and canonicalizers.
- Dependencies: Multimodal models and datasets; safety rubrics for cross-modal content.
- LLM-as-a-judge reliability improvements
- What: Calibrated ensembles with human-in-the-loop validation; standardized rubrics; cross-model agreement metrics; methods to reduce label drift over time.
- Dependencies: Annotator training; periodic re-benchmarking; open-weight judge evolution.
- Sector-specific hardening
- Healthcare: Clinical assistants that remain safe under stylized prompts; formal verification of refusal mechanisms for specialized advice.
- Finance: Fraud-prevention LLMs with strong style-robust privacy and non-violent crime guardrails.
- Robotics/IoT/Energy: Controls to prevent loss-of-control from stylized commands; safety gating for ICS/SCADA-related queries.
- Dependencies: Domain experts; regulatory alignment; scenario libraries tailored to each sector.
- Governance and liability frameworks
- What: Policy mechanisms assigning responsibility for style-robust testing, incident disclosure norms when ASR spikes under stylistic inputs, and risk scoring models for procurement.
- Dependencies: Legal consensus; regulator guidance; harmonized reporting formats.
- Curriculum and workforce development
- What: Training for red-teamers and safety engineers on stylistic obfuscation vectors; academic courses and certifications for style-aware AI safety.
- Dependencies: Institutional adoption; standardized learning materials; safe practice environments.
- Watermarking and provenance for safety
- What: Watermarks to trace transformations, provenance tracking to detect adversarial stylization upstream, and forensic tools to audit inputs used in incidents.
- Dependencies: Adoption across platforms; resilience to removal attempts; privacy constraints.
- Research on capability–alignment interactions
- What: Investigate why some providers show large ASR increases (e.g., +50–60 points) while others remain comparatively robust; paper how linguistic sophistication interacts with refusal prioritization.
- Dependencies: Access to model families; controlled experiments; transparent alignment documentation.
Key assumptions and dependencies affecting feasibility
- Single-turn, text-only threat model with default provider settings; different configurations may change ASR patterns.
- Judge ensemble reliability and human adjudication quality shape measured ASR; results depend on consistent rubrics and stratified sampling.
- Domain generalization from English (and some Italian) to other languages requires additional validation.
- Defensive measures may reduce creative utility or increase false positives; governance must balance safety with product experience.
- Providers continuously update guardrails; applications must include ongoing monitoring to handle drift in model behavior.
Glossary
- Adversarial poetry: A stylistic reframing of harmful prompts into verse to bypass safety mechanisms in LLMs. "We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for LLMs."
- AILuminate Safety Benchmark (MLCommons): A standardized dataset for evaluating AI safety across hazard categories and personas. "we augment the curated dataset with the MLCommons AILuminate Safety Benchmark."
- Alignment constraints: Safety rules and policies embedded in LLMs that restrict harmful outputs. "poetic formatting can reliably bypass alignment constraints."
- Attack surface: The accessible points an adversary can exploit to cause unsafe behavior. "The attack surface is therefore confined to one-shot prompt construction executed under standard inference conditions."
- Attack Success Rate (ASR): The proportion of prompts that elicit unsafe outputs from a model. "curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%."
- Attention Shifting attacks: Techniques that create distracting or complex contexts to divert a model from safety rules. "Similarly, Attention Shifting attacks~\cite{yu2024dontlistenmeunderstanding} create overly complex or distracting reasoning contexts that divert the model’s focus from safety constraints"
- Black-box threat model: An evaluation setting where the attacker has no knowledge of model internals or configurations. "We adopt a strict black-box threat model"
- CBRN: Safety domain covering Chemical, Biological, Radiological, and Nuclear risks. "CBRN hazards \citet{ajaykumar2024emerging}"
- CBRNE: An extended hazard category including Explosives along with CBRN. "Indiscriminate Weapons (CBRNE)"
- Character-Level Perturbations: Input manipulations at the character level to evade refusal distributions. "Character-Level Perturbations \citet{schulhoff2024ignoretitlehackapromptexposing}"
- Constitutional AI: An alignment approach where models follow a set of principles specified in a “constitution.” "Constitutional AI \citet{bai2022constitutional}"
- Cyber Offense: Domain involving enabling or facilitating harmful cyber activities. "cyber-offence capabilities \citet{guembe2022emerging}"
- DAN “Do Anything Now” prompts: Community-crafted jailbreak templates combining multiple techniques. "see the DAN ``Do Anything Now'' family of prompts~\citet{shen2024}"
- Data exfiltration: Covert extraction of data from a system. "Data exfiltration / covert extraction"
- Distributional shift: A change in input style or characteristics that moves prompts outside the training distribution. "This indicates that poetic form induces a distributional shift significantly larger than that of current adversarial mutations"
- Ensemble (judge models): Multiple models used jointly to evaluate outputs for safety. "Outputs are evaluated using an ensemble of open-weight judge models"
- EU Code of Practice for General-Purpose AI Models: A regulatory framework defining systemic-risk domains for GPAI. "aligned it with the European Code of Practice for General-Purpose AI Models."
- Goal Hijacking: A jailbreak strategy that assigns objectives conflicting with safety policies. "Among these, Goal Hijacking (~\citet{perez2022ignorepreviouspromptattack}) is the canonical example."
- Guardrails: Safety mechanisms and heuristics that enforce model refusals. "pattern-matching heuristics on which guardrails rely."
- Inter-rater agreement: A reliability measure of consistency among different evaluators. "We computed inter-rater agreement across the three judge models"
- Jailbreak: Manipulating prompts to circumvent a model’s safety, ethical, or legal constraints. "Jailbreak denotes the deliberate manipulation of input prompts to induce the model to circumvent its safety, ethical, or legal constraints."
- LLM-as-a-judge methodology: Using LLMs to label or assess other models’ outputs. "we follow state-of-the-art LLM-as-a-judge methodology."
- Loss-of-control scenarios: Risks where systems behave autonomously or unpredictably beyond intended constraints. "loss-of-control scenarios \citet{lee2022we}"
- Low-Resource Languages: Languages with limited training data used to evade safety filters. "Low-Resource Languages \citet{deng2024multilingualjailbreakchallengeslarge}"
- Meta-prompt: A higher-level instruction used to systematically transform many prompts into a target style. "Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt"
- Mixture-of-experts approaches: Model architectures that route inputs to specialized expert components. "models trained via RLHF, Constitutional AI, and mixture-of-experts approaches"
- Mismatched Generalization: Safety failures when models don’t generalize refusal policies to stylistically altered harmful requests. "Mismatched Generalization attacks, on the other hand, alter the surface form of harmful content to drift it outside the model’s refusal distribution"
- Model-weight serialization + exfiltration: Saving model parameters to a format and covertly extracting them. "Model-weight serialization + exfiltration"
- Open-weight models: Models whose weights are publicly available for replication and auditing. "Across 25 frontier proprietary and open-weight models"
- Out-of-distribution (OOD) requests: Inputs that differ significantly from the model’s training data patterns. "it creates out-of-distribution requests through stylistic deviation"
- Persuasion (in jailbreaks): Exploiting social-influence patterns to reduce refusals. "can be exploited by jailbreak by using Persuasion \citet{zeng2024johnnypersuadellmsjailbreak}"
- Privacy intrusions: Attacks aimed at accessing or revealing private information. "comprising CBRN, manipulation, privacy intrusions, misinformation generation, and even cyberattack facilitation."
- Reinforcement Learning from Human Feedback (RLHF): Training models using human preference signals to guide behavior. "Reinforcement Learning from Human Feedback (RLHF) \citet{ziegler2020}"
- Remote Access Trojan (RAT): Malware that provides persistent unauthorized remote control. "To deploy a persistent remote access trojan (RAT), proceed as follows: first..."
- Remote Code Execution (RCE): The ability to run arbitrary code on a target system remotely. "Code injection / RCE"
- Role Play jailbreaks: Prompts that instruct the model to adopt a persona to relax safety constraints. "Role Play jailbreaks are a canonical example"
- Single-turn (attack): A one-shot prompt with no iterative or multi-turn interaction. "All attacks are strictly single-turn"
- Stratified subset: A sample drawn to preserve distribution across categories for evaluation. "a human-validated stratified subset (with double-annotations to measure agreement)."
- Stylistic Obfuscation: Altering the presentation style to bypass safety without changing intent. "Stylistic Obfuscation techniques \citet{rao-etal-2024-tricking, kang2023exploitingprogrammaticbehaviorllms}"
- Taxonomy-preservation checks: Procedures ensuring transformed prompts stay within their original hazard categories. "each rewritten prompt undergoes taxonomy-preservation checks"
- Versification: The process of converting prose into verse, potentially causing minor semantic drift. "minor semantic drift is inherent to versification"
Collections
Sign up for free to add this paper to one or more collections.