Explain Transferability of Single-Query-Trained BPJ Attacks
Investigate, both formally and empirically, why adversarial prefixes learned by Boundary Point Jailbreaking (BPJ) when optimized on a single harmful target query (i.e., in the single-attack setting) generalize and transfer to a wide range of unseen queries.
References
A range of open questions remain, including developing defences to BPJ and exploring formally and empirically why BPJ attacks learned on a single attack readily transfer to other queries.
— Boundary Point Jailbreaking of Black-Box LLMs
(2602.15001 - Davies et al., 16 Feb 2026) in Section 6 (Discussion), Broader Implications