Cross-linguistic generality of adversarial poetry jailbreaks

Determine whether the elevated attack-success rates induced by poetic reformulation of harmful prompts generalize beyond English and Italian to other languages, scripts, and culturally distinct poetic forms, and ascertain how any cross-linguistic generalization interacts with model pretraining corpora and safety alignment distributions.

Background

The paper presents evidence that reformulating harmful requests as poetry functions as a universal single-turn jailbreak operator across 25 open-weight and proprietary LLMs. Hand-crafted adversarial poems achieved an average attack-success rate (ASR) of 62%, and a meta-prompt conversion of 1,200 MLCommons AILuminate harmful prompts into verse yielded ASRs substantially higher than prose baselines, with effects transferring across CBRN, cyber-offense, manipulation, privacy, and loss-of-control domains.

However, the evaluation was conducted only in English and Italian. The authors explicitly note that it is unknown whether the observed vulnerability extends to other languages, scripts, or culturally distinct poetic traditions. Establishing cross-linguistic generality and its dependence on pretraining distributions and alignment methods is critical for assessing the systemic risk surface and for designing defenses that are robust across linguistic regimes.

References

Sixth, the evaluation is limited to English and Italian prompts. The generality of the effect across other languages, scripts, or culturally distinct poetic forms is unknown and may interact with both pretraining corpora and alignment distributions.

— Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models (2511.15304 - Bisconti et al., 19 Nov 2025) in Section Analysis, Subsection Limitations

Cross-linguistic generality of adversarial poetry jailbreaks

Background

References

Related Problems