- The paper introduces LAP as a diagnostic method using logit lens evaluations to predict when steering vectors will successfully control language model outputs.
- It reveals a crystallization gap where nonlinear concept recovery precedes linear output alignment, with peak A_lin correlating up to +0.91 with steering efficacy.
- The approach yields a 30× computational saving and informs optimal layer selection and intervention strategies between linear vs. nonlinear steering methods.
Predicting Steering Vector Success in LLMs
Introduction
The paper "Predicting Where Steering Vectors Succeed" (2604.15557) addresses the central challenge of predicting when and where steering vectors—activation manipulations in the residual stream—are capable of reliably controlling concept-specific behavior in LMs. Steering vectors have been shown to shift attributes such as refusal, truthfulness, or factual outputs, but their effectiveness is erratic across models, layers, and target concepts. Prior practice largely relied on heuristics, particularly intervening in mid-model layers, with no systematic understanding of linear steerability limits. This work proposes the Linear Accessibility Profile (LAP): a diagnostic methodology leveraging logit lens evaluations to predict both if and at which layer difference-of-means steering will succeed.
LAP is constructed by applying the model’s own output unembedding to intermediate layer activations using the logit lens technique. Specifically, single-token accuracy is computed when the unembedding projects activations at each layer, quantifying the degree to which a concept is output-aligned and hence steerable by linear activation addition. This is supplemented with two auxiliary diagnostics:
- Probe Gap: A residual MLP is trained per layer to measure nonlinear recoverability, identifying where information is present nonlinearly but not linearly accessible via the unembedding.
- Perturbation Sensitivity (λ): Evaluates local instability in the representation, identifying layers where interventions have unreliable effects due to chaotic post-layer processing.
The LAP thus consists of per-layer (Alin,Amlp,λ) triplets, where Alin is the logit lens accuracy, Amlp is nonlinear probe accuracy, and λ measures perturbation amplification.
Empirical Findings
Layerwise Accessibility and Crystallization
Experiments across diverse LMs (Gemma-2-2B, Llama-8B, Qwen-7B, Mistral-7B, non-transformers) demonstrate that concepts are nonlinearly recoverable much earlier in the network than they are linearly accessible via output projection. In all evaluated models, Alin remains near zero for the first 70–80% of layers, with sharp emergence in the penultimate layers. For example, sequence prediction reaches nonlinear probe accuracy >0.9 at layer 5, but Alin remains 0 until layer 18 in Gemma-2-2B. This "crystallization gap" highlights the brittle alignment of model-internal representations and the output head.
Predicting Steering Efficacy
A key result is that peak Alin strongly predicts both (i) which layer guarantees maximal steering efficacy and (ii) which concepts are steerable at all via difference-of-means directions. Across 24 controlled binary concept families and five models, the correlation between peak Alin and maximum achievable steering ((Alin,Amlp,λ)0) is (Alin,Amlp,λ)1 to (Alin,Amlp,λ)2, significantly outstripping correlations for trained probe accuracy or layer depth alone.
Layer selection using peak (Alin,Amlp,λ)3 matches the empirical steering optimum in 4/5 core families and in all tested architectures, indicating that "middle-layer steering" is only effective when the concept is both present and output-aligned at that layer. Notably, steering at high probe-accuracy layers found by generic probes often produces zero change if (Alin,Amlp,λ)4 is low, since their separability does not translate to output alignment.
Three-Regime Framework
The authors formalize a three-regime taxonomy:
- Regime 1: Concept is absent ((Alin,Amlp,λ)5 low, (Alin,Amlp,λ)6 low): no steering method is effective.
- Regime 2: Concept is present but nonlinearly encoded ((Alin,Amlp,λ)7 high, (Alin,Amlp,λ)8 low): linear steering fails, but SAE-like nonlinear methods may succeed.
- Regime 3: Concept is output-aligned ((Alin,Amlp,λ)9 high, Alin0 high): linear steering is highly effective.
This diagnostic demarcation guides both where to steer and which method (linear vs. nonlinear) to employ.
Robustness and Scaling Properties
Emergence patterns and LAP predictions generalize across transformer families and non-transformer SSMs/RWKVs. The correlation between Alin1 and steerability strengthens with model size, and cross-model LAP values are themselves predictive: LAP computed on Qwen-1.5B predicts steering success on Llama-8B and Qwen-7B with Alin2, supporting transferability and cost-effective screening. As model size grows, the fraction of steerable concepts and mean Alin3 both increase, though with saturation beyond several billion parameters.
Failure Modes and Fine-Grained Analysis
Cases where high LAP does not strictly translate to strong steering are explained via misalignment between the computed difference-of-means direction and the unembedding projection (low Alin4 target token rank), or due to high baseline probability mass assigned to the desired target. The analysis also details "chaotic regime" failures, where high perturbation sensitivity causes interventions to act unpredictably.
Practical and Theoretical Implications
The LAP diagnostic operationalizes an inexpensive, training-free protocol (one forward pass, Alin5 matrix multiplications) for concept and layer screening, offering a Alin6 computational saving over brute-force techniques. Its predictive power supports both interpretability and intervention strategies: practitioners can assess the viability of steering for any given concept in advance, rather than post-hoc trial-and-error. LAP also informs decisions on method complexity, identifying when sophisticated SAE-based or nonlinear steering is warranted.
Theoretically, the results substantiate the output-alignment principle for causal interventions: effective steering requires concept alignment with the model’s output head. The strong cross-model consistency suggests that output-aligned linear subspaces for given concepts are structurally stable across architectures and scales. The persistent presence of nonlinear-only regimes indicates limitations of the linear representation hypothesis for substantial regions of model depth and motivates further interpretability work on sub-output manifold features.
Future Directions
Results suggest multiple avenues for expansion:
- Extending LAP diagnostics to multi-token completions, e.g., generative behaviors and distributional outputs, with preliminary success in refusal and entity-steering tasks.
- Systematic integration with SAE/transcoder methods for nonlinear concept extraction in regime 2 and validation of the regime framework for circuit-level mechanistic interpretability.
- Deeper exploration of final-layer anomalies and architecture-specific lens mismatches (e.g., RMSNorm vs. LayerNorm distinctions).
- Benchmarking broader concept families with the released controlled dataset to stress-test interpretability and intervention tools.
Conclusion
The presented analysis demonstrates that the logit lens, via the LAP diagnostic, robustly predicts where and when steering vectors will be effective in LLMs. This method obviates trial-and-error approaches to activation engineering, identifies necessary alignment for linear interventions, and delineates when nonlinear methods are essential. The approach is computationally efficient, transferable across architectures, scalable, and theoretically principled. The accompanying codebase and controlled benchmarks facilitate reproducibility and adoption for future interpretability research and practical activation engineering.