Develop and evaluate mitigation strategies against representation-level hijacking

Develop and rigorously evaluate mitigation strategies that defend large language models against in-context representation hijacking of token semantics, moving beyond attack characterization to concrete protective mechanisms.

Background

The paper focuses on exposing and analyzing a representation-level attack but explicitly notes that mitigation strategies have not yet been evaluated. The authors frame this as an open question, pointing to the need for defenses that address evolving semantic representations during inference.

References

Second, our work focuses on the attack surface and does not yet evaluate specific mitigation strategies. These open questions serve as stepping stones toward a new research frontier: representation-level alignment and defense.

In-Context Representation Hijacking (2512.03771 - Yona et al., 3 Dec 2025) in Section 6, Discussion, limitations, and future work