Papers
Topics
Authors
Recent
Search
2000 character limit reached

What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal

Published 9 Apr 2026 in cs.LG, cs.AI, and cs.CL | (2604.08524v1)

Abstract: Applying steering vectors to LLMs is an efficient and effective model alignment technique, but we lack an interpretable explanation for how it works-- specifically, what internal mechanisms steering vectors affect and how this results in different model outputs. To investigate the causal mechanisms underlying the effectiveness of steering vectors, we conduct a comprehensive case study on refusal. We propose a multi-token activation patching framework and discover that different steering methodologies leverage functionally interchangeable circuits when applied at the same layer. These circuits reveal that steering vectors primarily interact with the attention mechanism through the OV circuit while largely ignoring the QK circuit-- freezing all attention scores during steering drops performance by only 8.75% across two model families. A mathematical decomposition of the steered OV circuit further reveals semantically interpretable concepts, even in cases where the steering vector itself does not. Leveraging the activation patching results, we show that steering vectors can be sparsified by up to 90-99% while retaining most performance, and that different steering methodologies agree on a subset of important dimensions.

Summary

  • The paper presents a comprehensive mechanistic dissection of LLM refusal steering through multi-token activation patching and circuit discovery.
  • It reveals that steering effects predominantly propagate via the OV attention pathway, critically influencing model safety and jailbreaking defenses.
  • The study introduces sparsification techniques that isolate a sparse subset (90–99% zeros) of steering vector dimensions while retaining near-complete performance.

Mechanistic Analysis of Representation Steering for LLM Refusal Control

Introduction

Representation steering techniques, particularly activation addition with learned steering vectors, have become a standard tool for post-hoc alignment of LLMs. However, the mechanistic underpinnings of how steering vectors interact with model internals have remained largely uncharacterized. "What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal" (2604.08524) offers a rigorous mechanistic interpretability analysis focused on refusal steering—modulating LLM responses to either encourage or inhibit refusal behavior, which is key for both enhancing model safety and auditing vulnerability to jailbreaking.

Multi-Token Steering Interpretability Framework

The authors introduce a generalizable multi-token activation patching and circuit discovery methodology that extends classical mechanistic interpretability techniques to the context of representation steering in autoregressive, multi-token generation. Unlike prior single-token or final-layer analyses, their approach adapts edge attribution patching with integrated gradients (EAP-IG) to steered and unsteered model completions, providing indirect effect (IE) estimates for every computational edge in the model. This enables circuit discovery at the level of subgraphs responsible for steering-induced behavior change.

They demonstrate that only 10–11% of model edges downstream from the injection layer are required to recover 85% of the steering effect in Gemma 2 2B and Llama 3.2 3B, indicating that steering acts through highly localized, targeted subcircuits.

Cross-Method Circuit Convergence

Multiple steering vector acquisition methods are compared: the nonparametric Difference-in-Means (DIM) direction, Next-Token Prediction (NTP)-trained, and contrastive Preference Optimization (PO)-trained vectors. Despite moderate pairwise vector cosine similarities (0.10–0.42), the induced circuits for all three classes, when matched at the same injection layer, exhibit high overlap and are functionally interchangeable. Circuits constructed using one methodology remain highly faithful when evaluated with the steered generations from another, demonstrating functional universality of the underlying mechanistic pathways for a given concept at a fixed layer.

Mechanistic Insights: Interaction With Attention and MLP Submodules

A major finding is the centrality of the OV (output-value) component of the attention mechanism as the main propagation route for steering effects. Activation patching and edge distribution analysis reveal that top-causative edges overwhelmingly target attention value projections and, to a lesser extent, MLP modules and the LM head, while attention query and key pathways are nearly absent. Significantly, ablating (freezing) all attention QK (query-key) activations at all layers downstream has negligible impact on steering efficacy (~8.75% drop in ASR), contrasting with substantial performance drops (≥44.5%) when ablating OV paths, steering value vectors (SVVs), or direct MLP effects. This empirically refutes hypotheses that steering interacts with the computation of attention weights and supports a primarily value-centric mediation.

Through a residual stream decomposition, the direct effect of the steering vector on the output of attention heads is formulated and interpreted via logit lens projections. This analysis surfaces direct alignment between the most causally impactful dimensions and semantically interpretable refusal-related or harmful concepts—even when the raw steering vector itself is not interpretable in vocabulary space. Certain SVV directions are strongly aligned across acquisition methodologies; others diverge, likely due to superposition and redundancy in representation.

Steering Vector Sparsification and Dimensionality Analysis

Building on the dimension-level importance scores from activation patching, the study introduces gradient-based and indirect effect-based sparsification for steering vectors. These methods identify a highly sparse subset (up to 90–99% zeros) of steering vector dimensions that are sufficient for retaining nearly all steering performance, as evaluated by Attack Success Rate (ASR) across benign and adversarial datasets. The intersection-over-union (IoU) between sparse dimensional masks across DIM, NTP, and PO vectors is significantly above chance, indicating a shared small subspace within embedding space that encodes refusal steering most efficiently.

Random and magnitude-based sparsification baselines degrade more rapidly, supporting the claim that attribution-derived selection recovers functionally critical features with high fidelity. This finding has direct implications for designing targeted, interpretable, and robust steering interventions.

Implications and Future Directions

This case study establishes that regardless of steering vector construction methodology, the actionable mechanisms of refusal steering in LLMs are realized via a shared, highly redundant, and sparse subnetwork, with critical reliance on the OV attention and select MLP pathways. These results support future directions including:

  • More robust and interpretable alignment strategies constructed from a principled understanding of causal pathways
  • Improved model auditing and defense against jailbreaking by targeting the exposed shared pathways
  • Extension to other aligned behaviors and model classes (e.g., different layers, different concepts beyond refusal)
  • Exploration of MLP-specific pathways, which, although secondary, also contribute nontrivially to the steering effect

Mechanistically-informed sparsification provides efficiency and interpretability advantages for steering-based alignment.

Conclusion

This study provides a comprehensive mechanistic dissection of how representation steering for refusal manipulates the internal structure and dynamics of LLMs. Through rigorous causal circuit discovery, inter-methodology comparisons, and sparse decomposition, it reveals that steering vectors leverage a universal, highly localized OV-attention-centered circuit with a minuscule subset of significant vector dimensions. These insights refine our theoretical understanding of LLM controllability and have practical implications for robust, interpretable alignment and model safety strategies. The introduced tools and analytical framework have broader applicability for interpretability and editing throughout the model's residual stream, setting the stage for future concept-level alignment research in LLMs.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What this paper is about (in plain language)

This paper studies a simple “nudge” trick called representation steering for LLMs. A tiny vector (a short list of numbers) is added to the model’s hidden state while it’s generating text. That little nudge can push the model to be more likely to refuse unsafe requests or, if flipped, to stop refusing. The big question the authors ask is: what’s going on inside the model when this works?

They focus on one important behavior: refusal. That means the model saying “no” to harmful instructions. They look under the hood to find which parts of the model carry this steering signal and why it changes the output.


The main questions they ask

  • Which internal “wires” and components in an LLM carry the steering signal that makes it refuse or comply?
  • Do different ways of building steering vectors use the same internal pathways?
  • In the attention mechanism, is steering changing “where the model looks” (QK) or “what information it passes along” (OV)?
  • Can we break the steering vector into meaningful pieces we can understand?
  • Is most of the steering power concentrated in a few important “dimensions” that we can keep while dropping the rest?

How they studied it (explained simply)

Think of an LLM like a very complicated machine with lots of layers and connections:

  • A “steering vector” is like a tiny magnet that gently pulls the machine’s internal state in a certain direction each time it generates a word.
  • “Attention” is the part that helps the model decide which earlier words to focus on. It has two main parts:
    • QK (queries and keys): decides where to look (how much attention to pay).
    • OV (output values): what information to actually pass forward.

To see which parts matter, the authors use tools and tests:

  • Multi-token steering: They add the steering vector at every step of generation, not just once.
  • Activation patching: Imagine pausing the model mid-thought, swapping in certain internal signals from a steered run into an unsteered run (or vice versa), and seeing if the output changes. This shows which connections are truly causing the behavior.
  • Circuits: After patching, they identify a smaller sub-network (a “circuit”) of important connections that’s enough to reproduce the steered behavior.
  • Freezing tests: They “freeze” parts of attention to see what breaks steering. For example, they lock QK (where the model looks) to its original values and ask, “Does steering still work?”
  • Decomposition: They derive a “steering value vector” (SVV) for attention heads—essentially, the part of the nudge that directly affects the OV pathway. Then they use a simple interpretability tool (logit lens) to see which words these vectors prefer (e.g., words like “illegal,” “dangerous”).
  • Sparsification: They try removing most of the steering vector’s dimensions (set many numbers to zero) using gradient-based scores to keep only the most important ones—and test how well steering still works.

They test all of this on two open models (Gemma 2 2B and Llama 3.2 3B) and three steering-vector methods:

  • DIM (Difference-in-Means): simple average-difference between “refusing” and “not refusing” activations.
  • NTP (Next-Token Prediction): a learned vector optimized by standard language modeling.
  • PO (Preference Optimization): a learned vector optimized to prefer desired responses.

What they found and why it matters

Here are the key results in short, with why they’re important:

  • Different steering methods use the same internal pathways.
    • Finding: Circuits from DIM, NTP, and PO have very high overlap and are functionally interchangeable when applied at the same layer.
    • Why it matters: No matter how you compute the steering vector, the model tends to route the “refusal nudge” through the same subnetwork. That’s good news for building general tools.
  • Steering changes “what is passed along,” not “where the model looks.”
    • Finding: Steering mostly affects the OV circuit, not the QK circuit. When they froze QK (attention scores) to the unsteered values, steering barely got worse (~8.75% average drop). But disrupting OV or the steering value vectors caused large drops (45–72%).
    • Why it matters: The nudge works by injecting content into the stream of information, rather than changing the model’s focus. This helps target future safety defenses and improvements.
  • A small “circuit” is enough to recreate most of the effect.
    • Finding: About 10–11% of the model’s connections (edges after the steering layer) were sufficient to recover ~85% of the steered behavior (“faithfulness”).
    • Why it matters: The behavior is localized, not spread evenly everywhere—helping researchers focus attention on the most relevant parts.
  • The “steering value vectors” are interpretable even when the raw steering vector isn’t.
    • Finding: Breaking the steering vector into its per-head OV contribution (SVVs) revealed clear, meaningful words (e.g., “illegal,” “dangerous,” “forbidden”) with a simple tool, even when the raw vector looked meaningless.
    • Why it matters: This gives a window into what concepts the steering is injecting.
  • You can drop 90–99% of the steering vector and still do well.
    • Finding: Using gradient-based pruning, they removed most dimensions and kept high performance; on one model, steering still worked with only 9 nonzero dimensions out of 3072.
    • Why it matters: Steering can be made smaller, cheaper, and easier to store or share, while still working. Also, different methods agree on a shared subset of important dimensions.

What this could change going forward

  • Better, more robust alignment tools: Knowing that steering mainly works through OV and that a small circuit carries most of the effect helps engineers design cleaner, more targeted safety nudges with less side-effect on writing quality.
  • More interpretable safety mechanisms: The SVV decomposition makes the injected concepts easier to see and reason about.
  • Leaner steering: Since only a few dimensions and connections matter most, future systems can use sparse, efficient vectors that still work well.
  • Stronger defenses: Understanding how steering vectors can bypass safety via specific pathways may help developers build stronger protections. The authors note this knowledge must be used responsibly.

Note on ethics: The research aims to improve safety by understanding how refusal can be toggled. However, the same knowledge could be misused to defeat safety. The authors argue that benefits outweigh risks, especially because simpler “black-box” attacks already exist, and these insights can motivate better defenses.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The paper makes strong progress on mechanistic analysis of refusal steering but leaves several concrete issues unresolved. Future work could address the following:

  • Role and mechanisms of MLPs
    • The circuits show substantial MLP involvement, but the paper does not analyze which MLP layers/neurons mediate refusal steering, how they interact with OV paths, or whether specific MLP features can be causally isolated and edited to modulate refusal.
  • Layer dependence of steering mechanisms
    • All analyses are performed at a single “best” steering layer per model. It remains unknown whether circuit structure, OV vs QK reliance, and svv interpretability hold across earlier/later layers, or how circuit overlap changes when vectors are learned at different layers.
  • Generality beyond the refusal concept
    • Findings are only validated for refusal; it is unclear if similar circuit interchangeability, OV dominance, and sparsifiable subspaces exist for other concepts (e.g., persona, toxicity, helpfulness, reasoning, honesty).
  • Scaling to larger/other architectures
    • Only Gemma 2 2B and Llama 3.2 3B Instruct models are studied. It is unknown whether results persist in larger LLMs, base vs instruction-tuned models, RLHF-heavy models, mixture-of-experts, non-transformer variants, or post-LN architectures.
  • Robustness of “QK is negligible”
    • Freezing QK reduces steering by ~8.75% on average, but the variance across prompts, tasks, layers, models, and decoding regimes is not reported. Cases where QK matters (e.g., long-context retrieval, coreference, multi-turn dialogue) remain uncharacterized.
  • Temporal dynamics of steering in multi-token generation
    • While the method aggregates per-token effects, the paper does not analyze when in the sequence the steering signal is introduced, how it accumulates or decays across tokens, or whether early vs late tokens rely on different subcircuits.
  • Sensitivity to decoding and prompting choices
    • All results use greedy decoding and filtered examples where steering flips behavior. It is unknown how circuits and faithfulness change under sampling, temperature, nucleus sampling, beam search, or different prompt formats/system prompts.
  • Dependence on steering coefficient and schedule
    • The impact of the steering magnitude a and per-token schedules is not studied. It is unclear whether circuit membership, OV/QK reliance, svv composition, and sparsity profiles vary with different a values or time-varying steering.
  • Selection bias in activation patching datasets
    • Activation patching is conducted only on examples where steering succeeds. Circuit structure and importance may differ on failures or borderline cases; generalization of discovered circuits to the full data distribution is untested.
  • Methodological choices in EAP-IG
    • Faithfulness depends on edge attribution patching with integrated gradients (T=10), logit-difference metrics, greedy circuit construction, and masking choices. Sensitivity to these hyperparameters and to alternative attribution methods (e.g., direct patching, Shapley) is only partially explored.
  • Circuit minimality and uniqueness
    • Circuits are built via a greedy algorithm and evaluated at a fixed faithfulness threshold (≥0.85). Whether smaller or alternative circuits can achieve similar faithfulness, and the extent of circuit non-uniqueness, remains open.
  • svv decomposition assumptions and validation
    • The steering value vector (svv) derivation assumes pre-LN RMSNorm transformers and yields input-invariant projections with logit lens. It is unknown whether svv semantics persist across architectures, whether svv directions cause the predicted token-level effects under causal ablations, and how much of OV influence is via svv vs indirect pathways.
  • Head-level heterogeneity and antagonistic heads
    • Some heads show interpretable svvs across methods while others diverge or appear to “remove” concepts (negative IE). The prevalence, causes (e.g., superposition), and editability of such antagonistic heads are not quantified.
  • From edge-type counts to causal head/MLP maps
    • The analysis reports counts of edges landing on V/MLP/LM-head but does not map specific head-to-head or MLP-to-head pathways. A finer-grained causal graph of which heads/MLPs transmit refusal would enable targeted interventions.
  • Sparsification stability and side effects
    • Gradient/IE-based sparsification retains performance up to 90–99% sparsity, but stability across prompts, datasets, decoding settings, and steering magnitudes is not assessed. Impacts on fluency, helpfulness, coherence, and non-target tasks are not measured.
  • Sparse subspace consistency and semantics
    • Intersection-over-Union shows shared dimensions across methods, but whether these dimensions correspond to interpretable features (e.g., SAE features) or stable subspaces across datasets/models remains unverified.
  • Generalization to unseen adversaries and broader safety metrics
    • StrongReject is tested only on one model; broader adversarial benchmarks, human evaluation, and multi-dimensional safety metrics (false refusals, harm categories, jailbreak taxonomies) are not systematically evaluated.
  • Pre-steering layers and upstream influences
    • Circuit discovery ignores edges before the steering layer. Potential upstream interactions (e.g., tokenizer embeddings, early-layer MLP shaping of later V pathways) are unexamined.
  • Interaction with training-time alignment
    • How steering-induced circuits relate to alignment features learned during SFT/RLHF (e.g., are the same heads/MLPs used?) is unknown. Whether steering exploits or bypasses safety circuits remains open.
  • Practical defenses and red-teaming implications
    • The work highlights that sparse vectors can drive jailbreaks but does not propose or test defenses (e.g., detecting/neutralizing svv directions, regularizing OV pathways, robustifying QK contributions) or assess defense effectiveness against these mechanisms.
  • Cross-layer circuit interchangeability
    • Circuits are interchangeable across methods when applied at the same layer; interchangeability across different layers (e.g., DIM at L12 vs PO at L18) is not tested.
  • Multi-concept interference
    • It is unknown how multiple simultaneous steering vectors interact mechanistically (e.g., refusal + helpfulness), whether their circuits overlap/conflict, and how to compose or orthogonalize them safely.
  • Robustness to model perturbations
    • The persistence of discovered circuits under weight perturbations, fine-tuning, pruning, or quantization is not evaluated, limiting conclusions about stability in deployment settings.
  • Evaluation pipeline transparency
    • ASR depends on detecting “bypassed refusal,” but the detection procedure (rule-based vs classifier) and its error rates are not detailed. Measurement reliability and its influence on circuit discovery remain uncertain.

Practical Applications

Immediate Applications

Below are actionable, sector-linked applications that can be deployed with existing tooling and white-box access to models.

  • OV-focused runtime steering for safety alignment (software, trust & safety, healthcare, education)
    • What: Implement refusal steering by targeting attention OV pathways, using sparse vectors that retain performance at 90–99% sparsity, and deprioritize QK-path interventions given their ~8.75% impact.
    • Tools/Products/Workflows: OV-centric steering module in the inference stack; config-driven “refusal guardrail” vectors; gradient/IE-based sparsification; layer-specific hooks; runtime toggles for positive/negative refusal steering.
    • Assumptions/Dependencies: White-box access to residual stream at designated layers; validated layers per model (e.g., Gemma 2 2B L15, Llama 3.2 3B L12); pre-layernorm architectures with RMSNorm; careful evaluation of generation quality trade-offs.
  • Mechanistic auditing pipeline for jailbreak susceptibility (industry R&D, academia, policy compliance)
    • What: Use multi-token activation patching (EAP-IG) and circuit faithfulness (≥85% recovered with ~10–11% edges) to locate and stress-test refusal circuits post-deployment.
    • Tools/Products/Workflows: “Circuit-faithfulness evaluator” in CI; patching datasets curated from benign/harmful prompts; greedy graph construction; dashboard of node/edge importance and complements.
    • Assumptions/Dependencies: Access to activations; reproducible steered vs base generations; compute budget for integrated gradients (T≈10); benchmarks (JailbreakBench, Alpaca) may not fully reflect in-the-wild distributions.
  • Steering Value Vector (SVV) telemetry for real-time safety monitoring (software trust & safety, policy)
    • What: Decompose steering effects into head-specific SVVs and project via logit lens to track interpretable harmful/refusal concepts at runtime, even when raw vectors are not interpretable.
    • Tools/Products/Workflows: “OV Guard” monitor aggregating SVV token distributions across heads; alerts on suppression of refusal concepts; per-head sign checks to catch concept removal.
    • Assumptions/Dependencies: Stable mapping from SVVs to token concepts across updates; false positives tuned via thresholds; model-family-specific head maps.
  • Cross-method steering interoperability and standardized layer hooks (software platforms, MLOps)
    • What: Swap DIM/NTP/PO vectors across shared circuits at the same layer (high overlap/interchangeability) to standardize deployment while maintaining effect.
    • Tools/Products/Workflows: “Steering slots” API in model servers; registry of vetted layer-specific vectors; automated cross-method validation harness.
    • Assumptions/Dependencies: Same-layer application; moderate cosine similarity acceptable; consistency across minor model variants must be empirically checked.
  • Precision reduction of false refusals and side effects via dimension-level control (product UX, education, healthcare)
    • What: Use gradient/IE-based sparsification to retain refusal effect while reducing collateral refusals on harmless prompts by pruning dimensions that overexpress harmful/refusal tokens.
    • Tools/Products/Workflows: Dimension ranking and pruning dashboards; A/B testing on benign-task sets; “precision refusal” presets per domain (e.g., medical disclaimers vs general safety).
    • Assumptions/Dependencies: Requires domain-specific evaluation; sparse vectors may still interact with MLP nodes—monitor generation quality; potential superposition effects across heads.
  • Defensive red-teaming acceleration (industry trust & safety teams, auditors)
    • What: Emulate white-box steering attacks to probe resilience of alignment, focusing on OV vulnerabilities and sparse vectors.
    • Tools/Products/Workflows: Internal red-team harness using activation patching; OV-ablation and SVV-only ablation to quantify sensitivity; reporting into risk registries.
    • Assumptions/Dependencies: Strict governance to prevent misuse; ethics review; scope-limited access to model internals.
  • On-device alignment for consumer applications via sparse vectors (mobile/software, daily life)
    • What: Reduce compute/latency overhead by applying highly sparse refusal vectors on-device for parental controls or safe-mode.
    • Tools/Products/Workflows: Lightweight steering kernels; precompiled sparse masks; toggles in consumer apps (kid-friendly mode).
    • Assumptions/Dependencies: Device hardware supports layer hooks; privacy constraints; careful UX to avoid user-driven circumvention.

Long-Term Applications

The following opportunities require further research, scaling, standardization, or productization beyond current prototypes.

  • Architecture-level defenses against OV manipulation (model development, software)
    • What: Design models and training procedures that distribute refusal features across OV and QK/MLP paths, or introduce “OV-hardening” layers and detectors to reduce steerability of harmful concepts.
    • Tools/Products/Workflows: Training-time regularizers; adversarial feature training (e.g., refusal feature adversarial training); runtime OV anomaly detectors.
    • Assumptions/Dependencies: Validation at larger scales and diverse model families; avoiding performance regressions; compatibility with attention variants.
  • Alignment auditing and certification standards (policy, compliance, procurement)
    • What: Establish regulatory guidance that requires circuit-faithfulness tests, OV/QK sensitivity audits, and SVV interpretability reports for safety-aligned models.
    • Tools/Products/Workflows: Standardized audit protocols; third-party certifications; reporting templates (circuit overlap, complement faithfulness, sparsity resilience).
    • Assumptions/Dependencies: Multi-stakeholder consensus; access to model internals for auditors; harmonized benchmarks beyond specific datasets.
  • Generalized concept steering toolkit beyond refusal (software, education, healthcare)
    • What: Apply multi-token patching and SVV decomposition to other residual-stream vectors (e.g., sparse autoencoder features, editing vectors) for controllable style/persona, domain-compliance, and content policy enforcement.
    • Tools/Products/Workflows: “Steering Vector Studio” supporting concept discovery, SVV analysis, sparsification, and deployment; domain presets (medical, legal, educational).
    • Assumptions/Dependencies: Concept-specific validation; superposition/entanglement management; cross-layer generalization.
  • Adaptive, context-aware guardrails using SVV dynamics (consumer apps, daily life, education)
    • What: Real-time adjustment of refusal thresholds based on SVV signals and context (task, user profile, jurisdiction), minimizing unnecessary refusals while blocking genuinely harmful outputs.
    • Tools/Products/Workflows: Policy engines that read SVV telemetry; programmable guardrails; feedback loops to tune per-context dimensions.
    • Assumptions/Dependencies: Robust detection with low latency; privacy- and fairness-aware policy logic; guardrail quality measurement.
  • Vendor-agnostic “alignment API” and interoperable steering slots (software ecosystem)
    • What: Define a cross-model API for layer-level hooks, vector formats, and circuit reporting so safety modules can plug into different LLM families with minimal rework.
    • Tools/Products/Workflows: Open specification; adapters for major open models; conformance tests for circuit interchangeability and faithfulness.
    • Assumptions/Dependencies: Cooperation from model providers; abstracting architectural differences; security controls against misuse.
  • Edge and robotics safety dialog control (robotics, IoT, automotive)
    • What: Deploy sparse refusal steering to ensure safety-compliant interactions in devices with constrained compute, e.g., service robots or in-car assistants.
    • Tools/Products/Workflows: Embedded inference with sparse masks; safety policy bundles; periodic OV integrity checks.
    • Assumptions/Dependencies: Reliable on-device hooks; rigorous safety validation; resilience to distribution shifts in user interactions.
  • Training methods robust to sparse and OV-centric jailbreaks (academia, industry)
    • What: Develop curricula and training strategies (e.g., feature-level adversarial training, preference optimization with OV-aware constraints) that reduce steerability of harmful features and improve refusal robustness.
    • Tools/Products/Workflows: Feature-level adversarial datasets; OV/QK/MLP targeted losses; post-hoc mechanistic evaluations integrated into training loops.
    • Assumptions/Dependencies: Compute cost; scalability to larger models; avoiding over-regularization that harms capabilities.
  • Sector-specific compliance workflows (finance, healthcare, education)
    • What: Tailor steering and SVV monitors to enforce domain policies (e.g., medical disclaimers, financial risk warnings, age-appropriate content) with low overhead and clearer audit trails.
    • Tools/Products/Workflows: Domain SVV lexicons; compliance dashboards; periodic circuit audits tied to policy updates.
    • Assumptions/Dependencies: Accurate domain concept mapping; regulatory alignment; monitoring for unintended bias or over-refusal.

Notes on Assumptions and Dependencies (cross-cutting)

  • White-box access: Most methods require activation-level hooks and weight access; closed models limit applicability.
  • Layer specificity: Results are validated at particular middle layers; portability across layers and model sizes needs testing.
  • Model scope: Evidence is from Gemma 2 2B and Llama 3.2 3B; larger or different architectures may exhibit different circuit dynamics.
  • Safety and ethics: Defensive framing and access controls are essential to prevent misuse of steering knowledge for jailbreaking.
  • Evaluation breadth: Benchmarks (JailbreakBench, Alpaca, StrongReject) cover important but limited adversarial scenarios; in-the-wild generalization requires broader testing.
  • Generation quality: Steering and sparsification can affect fluency and task performance; monitor for degradation and false refusals.

Glossary

  • Activation addition: A steering technique that adds a fixed vector to hidden activations to influence model behavior at inference. "activation addition steering (Turner et al., 2023) is formulated as"
  • Activation patching: An intervention method that replaces selected activations to test their causal role in a behavior. "Activation patch- ing (Meng et al., 2022; Vig et al., 2020) identifies the submodules that are causally responsible for a specific behavior."
  • Attack Success Rate (ASR): Evaluation metric measuring the fraction of generations that bypass a model’s refusal. "We evaluate steering performance via Attack Success Rate (ASR), the proportion of completions that have bypassed refusal."
  • Circuit: An end-to-end subgraph of the computational graph that implements a specific behavior. "a circuit C (Wang et al., 2023) is an end-to-end subgraph of M that is responsible for a specific model behavior."
  • Edge attribution patching with integrated gradients (EAP-IG): A scalable approximation to activation patching that attributes importance to edges using integrated gradients. "We employ edge attribu- tion patching with integrated gradients (EAP-IG) (Hanna et al., 2024), which demonstrates state-of- the-art performance (Mueller et al., 2025)."
  • Faithfulness: A measure of how well a discovered circuit reproduces the model’s behavior when other components are ablated. "We use the faithfulness met- ric (Marks et al., 2025; Wang et al., 2023), defined as (m(C)-m(Ø))/(m(M) -m(Ø))"
  • Indirect effect (IE): A causal quantification of an edge’s importance by intervening on the computational graph. "the importance of (u, v) is quantified through its indirect effect (Pearl, 2013):"
  • Intersection over Union (IoU): A set-similarity metric used here to compare selected dimension subsets across methods. "we compute the Intersection over Union (IoU) of the nonzero dimensions"
  • Jailbreaking: Techniques that circumvent an aligned model’s safety training to elicit disallowed outputs. "refusal within the context of LLM jailbreaking (Wei et al., 2023)."
  • Logit difference: An importance metric based on the difference between target and baseline logits for clean vs. corrupt predictions. "We use logit difference (Zhang and Nanda, 2024) as our importance metric m"
  • Logit lens: A linear probe that projects internal vectors into logits over the vocabulary to interpret their semantics. "use logit lens (nostalgebraist, 2020) to project their svvs to the output vocabulary."
  • Mechanistic interpretability: The study of how specific internal components and computations produce behaviors. "We propose to extend traditional mechanistic in- terpretability techniques, typically applied only to standard LLM inference runs, to steered inference runs,"
  • Multi-head attention (MHA): The attention module consisting of multiple parallel attention heads. "The residual stream of a pre-layernorm transformer LLM is the sum of each layer's MLP and multi-head attention (MHA) outputs."
  • Multi-token steering: Applying a steering vector at multiple decoding steps across a generated sequence. "We study multi-token steering (Chen et al., 2025; Wu et al., 2025a), where the steering vector is re- peatedly added to each decoded token."
  • Next Token Prediction (NTP): A learning objective used to train steering vectors via standard language modeling. "NTP uses the LLM- ing objective to learn a steering vector on prompt- response pairs that express the desired concept;"
  • OV circuit: The attention pathway involving output (O) aggregation and value (V) projections, implicated as the main mediator of steering. "Refusal steering interacts with attention pri- marily through the OV circuit."
  • Preference Optimization (PO): A method for learning steering vectors using contrastive pairs differing only by concept expression. "PO uses contrastive responses that differ only by concept expression."
  • QK circuit: The attention pathway defined by query-key interactions that set attention scores. "freezing all attention scores (QK cir- cuit) drops performance by only 8.75%."
  • Residual stream: The running sum of layer outputs that carries information forward through the transformer. "The residual stream of a pre-layernorm transformer LLM is the sum of each layer's MLP and multi-head attention (MHA) outputs."
  • RMSNorm: A normalization layer that scales activations based on their root-mean-square magnitude. "y E Rd be the element-wise weights of the RMSNorm,"
  • Steering value vector (SVV): The head-specific vector capturing the direct effect of a steering vector on an attention head’s value output. "svvh(s) = (SOY)Why E Rd is the steering value vector of head h."
  • Steering value vector decomposition: A decomposition of steering effects into attention-head-specific value vectors that can be semantically interpreted. "We introduce the steering value vector decom- position,"
  • Steering vector: A direction in activation space added during inference to control a concept (e.g., refusal). "Applying steering vectors to LLMs is an efficient and effective model alignment technique,"
  • Superposition: The phenomenon where multiple features are encoded in overlapping directions in representation space. "possibly due to superposition (Elhage et al., 2022)."
  • Teacher forcing: A decoding technique that feeds the model the ground-truth tokens to condition subsequent steps for analysis. "We sequentially patch on each decoded token position by teacher forcing on the response."
  • Value projection: The linear transformation that maps residual activations to attention value vectors. "we subtract layer-normalized s from the input to the value projection at each layer during steered gen- eration"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 5 tweets with 413 likes about this paper.