Generalize ASTRA’s attention loss using distance metrics
Investigate the effectiveness of generalized attention-based loss functions for the ASTRA (Adversarial Subversion through Targeted Redirection of Attention) attack that replace the simple payload-attention sum with distances between an ideal and real attention distribution (for example, Kullback–Leibler divergence or Wasserstein distance), and determine whether such losses improve prompt injection attack performance against fine-tuning-based defenses such as SecAlign and StruQ.
References
We leave the exploration of the utility of more sophisticated distance functions to future work.
— May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks
(2507.07417 - Pandya et al., 10 Jul 2025) in Section 7.1, Discussion: A more general framework