Generalize ASTRA’s attention loss using distance metrics

Investigate the effectiveness of generalized attention-based loss functions for the ASTRA (Adversarial Subversion through Targeted Redirection of Attention) attack that replace the simple payload-attention sum with distances between an ideal and real attention distribution (for example, Kullback–Leibler divergence or Wasserstein distance), and determine whether such losses improve prompt injection attack performance against fine-tuning-based defenses such as SecAlign and StruQ.

Background

The paper introduces ASTRA, a white-box prompt injection attack that optimizes an attention-based loss to focus model attention on attacker-specified payload tokens. In Section 7.1, the authors suggest a more general formulation where the loss measures a distance between an ideal and the actual attention distribution, proposing the use of divergence or transport-based distances.

This generalization is not evaluated in the paper; the authors explicitly defer assessing whether such distance-based losses are useful. Given the strong empirical results with the current loss, exploring alternative distance measures could yield further performance gains or robustness insights.

References

We leave the exploration of the utility of more sophisticated distance functions to future work.

— May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks (2507.07417 - Pandya et al., 10 Jul 2025) in Section 7.1, Discussion: A more general framework

Generalize ASTRA’s attention loss using distance metrics

Background

References

Related Problems