Predicting Where Steering Vectors Succeed

Published 16 Apr 2026 in cs.LG and cs.CL | (2604.15557v1)

Abstract: Steering vectors work for some concepts and layers but fail for others, and practitioners have no way to predict which setting applies before running an intervention. We introduce the Linear Accessibility Profile (LAP), a per-layer diagnostic that repurposes the logit lens as a predictor of steering vector effectiveness. The key measure, $A_{\mathrm{lin}}$, applies the model's unembedding matrix to intermediate hidden states, requiring no training. Across 24 controlled binary concept families on five models (Pythia-2.8B to Llama-8B), peak $A_{\mathrm{lin}}$ predicts steering effectiveness at $ρ= +0.86$ to $+0.91$ and layer selection at $ρ= +0.63$ to $+0.92$. A three-regime framework explains when difference-of-means steering works, when nonlinear methods are needed, and when no method can work. An entity-steering demo confirms the prediction end-to-end: steering at the LAP-recommended layer redirects completions on Gemma-2-2B and OLMo-2-1B-Instruct, while the middle layer (the standard heuristic) has no effect on either model.

Abstract PDF Upgrade to Chat

Authors (1)

Jayadev Billa

Summary

The paper introduces LAP as a diagnostic method using logit lens evaluations to predict when steering vectors will successfully control language model outputs.
It reveals a crystallization gap where nonlinear concept recovery precedes linear output alignment, with peak A_lin correlating up to +0.91 with steering efficacy.
The approach yields a 30× computational saving and informs optimal layer selection and intervention strategies between linear vs. nonlinear steering methods.

Predicting Steering Vector Success in LLMs

Introduction

The paper "Predicting Where Steering Vectors Succeed" (2604.15557) addresses the central challenge of predicting when and where steering vectors—activation manipulations in the residual stream—are capable of reliably controlling concept-specific behavior in LMs. Steering vectors have been shown to shift attributes such as refusal, truthfulness, or factual outputs, but their effectiveness is erratic across models, layers, and target concepts. Prior practice largely relied on heuristics, particularly intervening in mid-model layers, with no systematic understanding of linear steerability limits. This work proposes the Linear Accessibility Profile (LAP): a diagnostic methodology leveraging logit lens evaluations to predict both if and at which layer difference-of-means steering will succeed.

Linear Accessibility Profile Formulation

LAP is constructed by applying the model’s own output unembedding to intermediate layer activations using the logit lens technique. Specifically, single-token accuracy is computed when the unembedding projects activations at each layer, quantifying the degree to which a concept is output-aligned and hence steerable by linear activation addition. This is supplemented with two auxiliary diagnostics:

Probe Gap: A residual MLP is trained per layer to measure nonlinear recoverability, identifying where information is present nonlinearly but not linearly accessible via the unembedding.
Perturbation Sensitivity ( $\lambda$ ): Evaluates local instability in the representation, identifying layers where interventions have unreliable effects due to chaotic post-layer processing.

The LAP thus consists of per-layer $(A_{\mathrm{lin}},\,A_{\mathrm{mlp}},\,\lambda)$ triplets, where $A_{\mathrm{lin}}$ is the logit lens accuracy, $A_{\mathrm{mlp}}$ is nonlinear probe accuracy, and $\lambda$ measures perturbation amplification.

Empirical Findings

Layerwise Accessibility and Crystallization

Experiments across diverse LMs (Gemma-2-2B, Llama-8B, Qwen-7B, Mistral-7B, non-transformers) demonstrate that concepts are nonlinearly recoverable much earlier in the network than they are linearly accessible via output projection. In all evaluated models, $A_{\mathrm{lin}}$ remains near zero for the first 70–80% of layers, with sharp emergence in the penultimate layers. For example, sequence prediction reaches nonlinear probe accuracy $>0.9$ at layer 5, but $A_{\mathrm{lin}}$ remains 0 until layer 18 in Gemma-2-2B. This "crystallization gap" highlights the brittle alignment of model-internal representations and the output head.

Predicting Steering Efficacy

A key result is that peak $A_{\mathrm{lin}}$ strongly predicts both (i) which layer guarantees maximal steering efficacy and (ii) which concepts are steerable at all via difference-of-means directions. Across 24 controlled binary concept families and five models, the correlation between peak $A_{\mathrm{lin}}$ and maximum achievable steering ( $(A_{\mathrm{lin}},\,A_{\mathrm{mlp}},\,\lambda)$ 0) is $(A_{\mathrm{lin}},\,A_{\mathrm{mlp}},\,\lambda)$ 1 to $(A_{\mathrm{lin}},\,A_{\mathrm{mlp}},\,\lambda)$ 2, significantly outstripping correlations for trained probe accuracy or layer depth alone.

Layer selection using peak $(A_{\mathrm{lin}},\,A_{\mathrm{mlp}},\,\lambda)$ 3 matches the empirical steering optimum in 4/5 core families and in all tested architectures, indicating that "middle-layer steering" is only effective when the concept is both present and output-aligned at that layer. Notably, steering at high probe-accuracy layers found by generic probes often produces zero change if $(A_{\mathrm{lin}},\,A_{\mathrm{mlp}},\,\lambda)$ 4 is low, since their separability does not translate to output alignment.

Three-Regime Framework

The authors formalize a three-regime taxonomy:

Regime 1: Concept is absent ( $(A_{\mathrm{lin}},\,A_{\mathrm{mlp}},\,\lambda)$ 5 low, $(A_{\mathrm{lin}},\,A_{\mathrm{mlp}},\,\lambda)$ 6 low): no steering method is effective.
Regime 2: Concept is present but nonlinearly encoded ( $(A_{\mathrm{lin}},\,A_{\mathrm{mlp}},\,\lambda)$ 7 high, $(A_{\mathrm{lin}},\,A_{\mathrm{mlp}},\,\lambda)$ 8 low): linear steering fails, but SAE-like nonlinear methods may succeed.
Regime 3: Concept is output-aligned ( $(A_{\mathrm{lin}},\,A_{\mathrm{mlp}},\,\lambda)$ 9 high, $A_{\mathrm{lin}}$ 0 high): linear steering is highly effective.

This diagnostic demarcation guides both where to steer and which method (linear vs. nonlinear) to employ.

Robustness and Scaling Properties

Emergence patterns and LAP predictions generalize across transformer families and non-transformer SSMs/RWKVs. The correlation between $A_{\mathrm{lin}}$ 1 and steerability strengthens with model size, and cross-model LAP values are themselves predictive: LAP computed on Qwen-1.5B predicts steering success on Llama-8B and Qwen-7B with $A_{\mathrm{lin}}$ 2, supporting transferability and cost-effective screening. As model size grows, the fraction of steerable concepts and mean $A_{\mathrm{lin}}$ 3 both increase, though with saturation beyond several billion parameters.

Failure Modes and Fine-Grained Analysis

Cases where high LAP does not strictly translate to strong steering are explained via misalignment between the computed difference-of-means direction and the unembedding projection (low $A_{\mathrm{lin}}$ 4 target token rank), or due to high baseline probability mass assigned to the desired target. The analysis also details "chaotic regime" failures, where high perturbation sensitivity causes interventions to act unpredictably.

Practical and Theoretical Implications

The LAP diagnostic operationalizes an inexpensive, training-free protocol (one forward pass, $A_{\mathrm{lin}}$ 5 matrix multiplications) for concept and layer screening, offering a $A_{\mathrm{lin}}$ 6 computational saving over brute-force techniques. Its predictive power supports both interpretability and intervention strategies: practitioners can assess the viability of steering for any given concept in advance, rather than post-hoc trial-and-error. LAP also informs decisions on method complexity, identifying when sophisticated SAE-based or nonlinear steering is warranted.

Theoretically, the results substantiate the output-alignment principle for causal interventions: effective steering requires concept alignment with the model’s output head. The strong cross-model consistency suggests that output-aligned linear subspaces for given concepts are structurally stable across architectures and scales. The persistent presence of nonlinear-only regimes indicates limitations of the linear representation hypothesis for substantial regions of model depth and motivates further interpretability work on sub-output manifold features.

Future Directions

Results suggest multiple avenues for expansion:

Extending LAP diagnostics to multi-token completions, e.g., generative behaviors and distributional outputs, with preliminary success in refusal and entity-steering tasks.
Systematic integration with SAE/transcoder methods for nonlinear concept extraction in regime 2 and validation of the regime framework for circuit-level mechanistic interpretability.
Deeper exploration of final-layer anomalies and architecture-specific lens mismatches (e.g., RMSNorm vs. LayerNorm distinctions).
Benchmarking broader concept families with the released controlled dataset to stress-test interpretability and intervention tools.

Conclusion

The presented analysis demonstrates that the logit lens, via the LAP diagnostic, robustly predicts where and when steering vectors will be effective in LLMs. This method obviates trial-and-error approaches to activation engineering, identifies necessary alignment for linear interventions, and delineates when nonlinear methods are essential. The approach is computationally efficient, transferable across architectures, scalable, and theoretically principled. The accompanying codebase and controlled benchmarks facilitate reproducibility and adoption for future interpretability research and practical activation engineering.

Markdown Report Issue