Explain Transferability of Single-Query-Trained BPJ Attacks

Investigate, both formally and empirically, why adversarial prefixes learned by Boundary Point Jailbreaking (BPJ) when optimized on a single harmful target query (i.e., in the single-attack setting) generalize and transfer to a wide range of unseen queries.

Background

The paper observes that BPJ prefixes optimized on a single query progressively improve transfer to unseen queries, yielding universal jailbreaks across diverse tasks and datasets, including HarmBench and long-form biological misuse questions. This emergent transferability appears robust across different systems and curricula.

While the theoretical section provides a stylized evolutionary framework and continuation analysis, it does not explain why single-query optimization leads to broad transfer. Understanding the mechanism behind this generalization—through formal modeling and empirical study—remains an open research direction.

References

A range of open questions remain, including developing defences to BPJ and exploring formally and empirically why BPJ attacks learned on a single attack readily transfer to other queries.

Boundary Point Jailbreaking of Black-Box LLMs  (2602.15001 - Davies et al., 16 Feb 2026) in Section 6 (Discussion), Broader Implications