Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 91 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 29 tok/s
GPT-5 High 26 tok/s Pro
GPT-4o 98 tok/s
GPT OSS 120B 470 tok/s Pro
Kimi K2 216 tok/s Pro
2000 character limit reached

HSI: Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models (2502.05945v2)

Published 9 Feb 2025 in cs.CL and cs.AI

Abstract: Robust alignment guardrails for LLMs are becoming increasingly important with their widespread application. In contrast to previous studies, we demonstrate that inference-time activation interventions can bypass safety alignments and effectively steer model generations towards harmful AI coordination for Llama 2. Our method applies fine-grained interventions at specific model subcomponents, particularly attention heads, using a simple binary choice probing strategy. These interventions then generalise to the open-ended generation setting effectively circumventing safety guardrails. We show that probing single attention heads is more effective than intervening on full layers and intervening on only four attention heads is comparable to supervised fine-tuning. We further show that only a few example completions are needed to compute effective steering directions, which is an advantage over classical fine-tuning. Our findings highlight the shortcomings of current alignment techniques. In addition, our results suggest that, at the attention head level, activations encode fine-grained linearly separable behaviors. Practically, the approach offers a straightforward methodology to steer LLM behaviour, which could be extended to diverse domains beyond safety requiring fine-grained control over the model output. The code and datasets for this study can be found on https://github.com/PaulDrm/targeted_intervention.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.