What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal
This presentation examines a rigorous mechanistic interpretability analysis of how representation steering controls refusal behavior in large language models. Through novel multi-token activation patching and circuit discovery methods, the research reveals that steering vectors—regardless of how they're trained—operate through a shared, highly localized subcircuit centered on attention value projections. The talk explores how only 10-11% of model edges are needed to recover 85% of the steering effect, and how a sparse subset of vector dimensions (as few as 1-10%) can retain nearly all steering performance, with direct implications for robust AI alignment and defense against jailbreaking.Script
Steering vectors let us control how language models behave after training, pushing them toward or away from specific behaviors like refusing harmful requests. But here's the puzzle: nobody really understood the mechanism. How do these vectors actually rewire the model's computations to change what it says?
The authors built a multi-token activation patching framework to trace exactly which computational edges in the model carry the steering effect forward. They compared three different ways of creating steering vectors: a simple difference-in-means approach, next-token prediction training, and preference optimization. Despite being trained completely differently, all three methods converged on the same surprising answer.
The steering effect flows through a tiny, shared subcircuit.
Here's what shocked them: freezing all attention query-key computations barely affected steering, dropping performance just 8.75%. But ablating the value projection pathway collapsed it by more than 44%. Steering doesn't change where the model looks. It changes what the model retrieves and propagates forward through those attention values.
The authors decomposed steering vectors into dimension-specific contributions and projected them through a logit lens. Certain dimensions aligned strongly with refusal-related concepts across all methods. Others were redundant or method-specific, likely due to superposition. But the core insight held: a tiny subspace encodes the refusal concept.
This has real consequences. If refusal steering depends on such a narrow, interpretable circuit, defenders can monitor it, and attackers can target it. The very precision that makes steering efficient also makes it a single point of failure. Understanding the mechanism is the first step toward making alignment strategies both more robust and more auditable.
Steering vectors don't scatter their influence across the whole model. They carve a narrow, predictable path through a handful of attention values and dimensions. That's both the power and the vulnerability of representation-based control. To explore this paper further or create your own research videos, visit EmergentMind.com.