Assess risks of representation hijacking beyond jailbreaks
Investigate whether in-context representation hijacking affects subtler behaviors in large language models, including biasing reasoning chains, interfering with tool use, or manipulating decision-making in high-stakes domains such as legal or medical applications, beyond explicit jailbreak scenarios.
Sponsor
References
First, beyond jailbreak scenarios, representation hijacking may pose risks in subtler domains, such as biasing reasoning chains, interfering with tool use, or manipulating decision-making in high-stakes contexts (e.g., legal or medical). Second, our work focuses on the attack surface and does not yet evaluate specific mitigation strategies. These open questions serve as stepping stones toward a new research frontier: representation-level alignment and defense.