Where to Steer: Input-Dependent Layer Selection for Steering Improves LLM Alignment

Published 4 Apr 2026 in cs.LG | (2604.03867v1)

Abstract: Steering vectors have emerged as a lightweight and effective approach for aligning LLMs at inference time, enabling modulation over model behaviors by shifting LLM representations towards a target behavior. However, existing methods typically apply steering vectors at a globally fixed layer, implicitly assuming that the optimal intervention layer is invariant across inputs. We argue that this assumption is fundamentally limited, as representations relevant to a target behavior can be encoded at different layers depending on the input. Theoretically, we show that different inputs can require steering at different layers to achieve alignment with a desirable model behavior. We also provide empirical evidence that the optimal steering layer varies substantially across inputs in practice. Motivated by these observations, we introduce Where to Steer (W2S), a framework that adaptively selects the intervention layer conditioned on the input, by learning a mapping from input embeddings to optimal steering layers. Across multiple LLMs and alignment behaviors, W2S consistently outperforms fixed-layer baselines, with improvements in both in-distribution and out-of-distribution settings. Our findings highlight the importance of input-dependent control in LLM alignment and demonstrate that adaptive layer selection is a key design dimension missing in the current methodology of steering vectors.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces W2S, a framework that dynamically selects optimal intervention layers to improve LLM steering compared to fixed-layer methods.
It achieves robust improvements in mean steerability (e.g., from 1.259 to 1.502 for CAA) and increases the proportion of steerable examples across 13 behavior datasets.
The approach generalizes across in-distribution and out-of-distribution prompts, addressing negative steerability issues and advancing LLM alignment practices.

Input-Dependent Layer Selection for Steering LLMs

Motivation and Theoretical Foundation

The standard paradigm for steering vectors in LLMs is to inject a behavior-shifting vector at a globally fixed layer, with recent techniques such as CAA and L2S relying on this assumption. However, this paper demonstrates both theoretically and empirically that the optimal steering layer is highly input-dependent. Theoretically, the authors construct a proof-of-concept model showing that the optimal intervention layer can vary across inputs even for a simple behavior function. Empirical analysis across 13 diverse target behaviors and two different chat-oriented LLMs (Llama-2-7B-Chat and Qwen-1.5-14B-Chat) reveals substantial variability in the most effective steering layer for different prompts, with the optimal layer often deviating by several layers from the baseline fixed-layer choice.

Figure 1: The distribution of input-optimal and fixed-layer steerability demonstrates that the best intervention layer is not uniform across prompts or tasks.

This result identifies a crucial missing axis in steering vector methodology: the need for input-dependent layer selection to improve alignment granularity without increasing model or context size.

The W2S Framework: Input-Dependent Layer Prediction

The authors introduce Where To Steer (W2S), a general framework for input-dependent layer prediction for steering vector interventions. The W2S pipeline is as follows: (1) For a given dataset, the input-specific optimal layer is identified via a sweep, maximizing a steerability metric for each prompt. (2) A prompt encoder (text-embedding-3-large, selected after extensive screening) is used to map each prompt to a semantic embedding. (3) A lightweight shallow MLP layer predictor is trained to map these embeddings to the optimal layer label, leveraging label-space pruning to avoid learning over layers that never constitute optima on the training set.

At inference, the input prompt is embedded and passed through the trained predictor to select the steering layer; the steering vector is then applied at this predicted layer.

Figure 2: The W2S pipeline—training identifies and predicts optimal layers from embeddings; inference applies steering at the predicted layer for each input.

Crucially, W2S operates as a modular, computationally efficient inference-time mechanism, requiring no modification to the LLM parameters nor significant increases in computation at deployment.

Experimental Results

In-Distribution Evaluation

W2S is evaluated across 13 Model-Written Evaluation (MWE) datasets for both Llama-2-7B-Chat and Qwen-1.5-14B-Chat. Results are reported for two classes of steering vectors: the static CAA and the dynamic L2S. The key metrics are (a) mean steerability—the slope of the logit-difference propensity curve under steering, and (b) the proportion of steerable examples.

W2S achieves consistent, robust improvements over fixed-layer baselines across all datasets and both LLMs. For Llama-2-7B-Chat, mean steerability jumps from 1.259 to 1.502 (CAA) and from 2.098 to 2.363 (L2S) with W2S; proportion of steerable examples increases from 0.754 to 0.846 (CAA) and 0.899 to 0.918 (L2S).

Figure 3: W2S improves average steerability across all target behaviors for both static and dynamic steering vector extraction methods.

Figure 4: W2S increases the mean proportion of steerable examples, indicating more prompts benefit from steering.

These improvements are robust across all tested behaviors, including complex categories such as advanced AI risk and personality/persona tasks.

Out-of-Distribution Generalization

Robustness is analyzed with prompt distribution shift—W2S predictors trained solely on a base configuration generalize to user/system prompt variants that either reinforce or attenuate the target behavior. The framework consistently outperforms fixed-layer methods in steerability and proportion of steerable inputs across all OOD prompt conditions. Importantly, W2S resolves negative steerability failure cases exhibited by fixed-layer baselines, indicating strong regularization against misalignment and misuse.

Analysis of Steering Layer Predictors

Although the classification accuracy of the layer predictor MLP remains moderate (owing to label sparsity and limited data), high accuracy is not required for downstream steerability improvements. Further, experiments show that selecting even the second or third most optimal layer (by frequency-aware label smoothing) preserves or enhances the measured steerability, substantiating the robustness of input-dependent selection as a principle.

Implications and Future Directions

These results strongly reject the fixed-layer steering assumption, identifying input-dependent intervention as a critical design axis in alignment. Practical implications are significant: W2S provides a computationally efficient, plug-and-play module that augments both static and dynamic steering vector methods with demonstrable gains in behavioral control, without additional training or context costs.

Theoretically, the variability of optimal intervention layers substantiates a more nuanced view of semantic concept encoding in LLMs and points toward richer mechanistic interpretability. This suggests future research directions encompassing: (1) extension to multi-layer or multi-modality steering, (2) task-specific fine-tuning of the predictor for enhanced downstream transfer, and (3) integration into tool-augmentation and automated alignment pipelines.

Conclusion

The paper introduces input-dependent layer selection via the W2S framework, providing strong theoretical and empirical evidence that fixed-layer steering is suboptimal. By predicting and steering at input-optimal layers, W2S consistently improves alignment, both in in-distribution and distribution-shift settings, for multiple steering vector paradigms and LLM families. This research refines the methodology of LLM alignment by revealing and exploiting an overlooked dimension of intervention, offering a new foundation for fine-grained, reliable control of large models (2604.03867).

Markdown Report Issue