In-Distribution Steering (IDS)
- In-Distribution Steering (IDS) is a targeted intervention technique that modulates internal model activations using in-distribution data to achieve precise behavioral shifts.
- IDS employs methods like DiffMean, Linear Probing, and dynamic steering, using metrics such as efficacy and general specificity to ensure minimal impact on unrelated skills.
- Empirical evaluations show that IDS can improve task-specific performance while maintaining overall fluency and control, though challenges like input-dependent variance remain.
In-Distribution Steering (IDS) encompasses a class of inference-time interventions that modulate the internal representations or input distributions of machine learning systems—most notably LLMs—such that desired behavioral changes are precisely effected within a regime that matches the system’s original data distribution. In IDS, interventions are not evaluated for global or out-of-distribution performance; rather, their specificity and efficacy are measured exclusively under in-distribution conditions, with stringent requirements that all other system capabilities are preserved. IDS is distinguished from generic activation steering or fine-tuning by its strict focus on preserving desired behavior—and only that behavior—on test examples drawn from the same underlying distribution as those used to learn the intervention, while minimizing extraneous side effects and ensuring high-fidelity control within that regime (Goyal et al., 5 Feb 2026).
1. Formalization and Scope of In-Distribution Steering
In the context of LLMs, IDS is formalized as an intervention , where denotes the activation at layer , position for input , is a learned steering vector, and is a scalar strength parameter. The critical restriction is that both the vector and all efficacy/specificity metrics are defined with respect to—and only with respect to—the same distribution from which positive (D⁺) and negative (D⁻) supervision examples are drawn.
Concretely, IDS answers two core questions:
- Efficacy: Does the steered model alter the targeted property as intended, on inputs sampled from the in-distribution regime?
- General Specificity: Does 0 maintain all other unrelated capabilities (e.g., fluency, retention of general reasoning skills) under in-distribution evaluation, as measured by negligible changes in general performance metrics such as MMLU or GSM8K accuracy and perplexity (Goyal et al., 5 Feb 2026)?
IDS deliberately brackets out questions of adversarial robustness, out-of-distribution generalization, and distributional-shift resilience, focusing on the narrow but crucial question of behavioral precision “at home.”
2. Mechanisms and Algorithms for IDS
A range of IDS methodologies have emerged, all sharing the principle of learning or extracting an intervention solely from in-distribution data and deploying it under identical conditions. Prototypical approaches include (Goyal et al., 5 Feb 2026, Vogels et al., 15 Oct 2025, Tan et al., 2024, Gadgil et al., 4 Apr 2026):
- Difference-in-Means (DiffMean): The steering vector is the average difference of internal activations for positive versus negative examples:
1
- Linear Probe (LP): A linear classifier is trained to separate labels in the latent space, and its weights are used directly as a steering direction.
- Supervised Steering Vector (SSV): Rank-1 optimization is performed to maximize log-likelihood of desired continuations on D⁺.
- Rank-1 Representation Finetuning (ReFT-r1): Combines probing and LM loss with a sparsity constraint on the intervention.
- Partial Orthogonalization (PartialOR): Constructs multiple steering directions (e.g., refusal vs. compliance), projects to ensure preservation of critical control behaviors, and ablates (removes) the unrelated components.
- Dynamic Steering (Input-Adaptive IDS): Steering strength or even the intervention site (layer) are made input-dependent—either by maximizing allowed deviation under a Mahalanobis distance ball in the learned positive-class PCA subspace (Vogels et al., 15 Oct 2025), or by using a learned mapping from input embeddings to optimal layer selection (Gadgil et al., 4 Apr 2026).
Recent IDS methods, such as ROAST (Rollout-based On-distribution Activation Steering), extract steering directions directly from the model’s own free-running output rather than externally supervised activations, ensuring both extraction and deployment live on the same native activation manifold (Su et al., 15 Feb 2026).
3. Evaluation Protocols and In-Distribution Metrics
Empirical evaluation of IDS employs protocols and metrics designed to quantify both precise control and unintended side effects. The principal distinctions are:
| Metric | Description | Evaluation Domain |
|---|---|---|
| Efficacy | Target behavior shift (compliance, faithfulness) | In-distribution only |
| General Specificity | Retention of fluency and unrelated skills | In-distribution only |
| Control Specificity | Retention of related safety/alignment behaviors | In-distribution only |
General specificity is operationalized as the shift in benchmark metrics, 2, with values near zero indicating preservation. Control specificity is measured on control properties (e.g., refusal rates on truly harmful queries), again requiring minimal degradation under IDS (Goyal et al., 5 Feb 2026). No adversarial or OOD testing occurs within the IDS regime.
Use-case studies such as overrefusal reduction and hallucination mitigation on Llama-8B-Instruct demonstrate that interventions can effect 4–10 percentage point improvements on the intended task, while holding ΔMMLU, ΔGSM8K, and ΔFluency within pre-defined statistical significance bounds, and keeping ΔControl specificity negligible (Goyal et al., 5 Feb 2026).
4. Empirical Findings and Limitations
Across all major IDS benchmarks, the following points are robustly supported (Goyal et al., 5 Feb 2026, Tan et al., 2024, Su et al., 15 Feb 2026):
- High In-Distribution Efficacy: IDS methods reliably steer the intended property within the original data regime, as measured by direct task metrics and minimal collateral damage to unrelated skills.
- Fine-Grained Preservation: Both general abilities and closely related control behaviors remain statistically unchanged under IDS, supporting the claim of surgical, property-specific interventions.
- Input-Dependent Challenges: Despite strong average performance, in-distribution steerability can be highly variable across individual inputs. Substantial fractions of test examples may be “anti-steerable”—where application of the steering vector moves behavior along the wrong axis or even inverts the intended property (Tan et al., 2024).
- Spurious Biases: In-distribution steerability can be confounded by superficial prompt factors (e.g., choice-token assignment, position bias) that were not fully randomized away. These biases can account for much of the observed steerability variance and must be explicitly controlled (Tan et al., 2024).
- Lack of Robustness: IDS provides no guarantees under distribution shift or on adversarially perturbed inputs. Interventions achieving perfect specificity in-distribution can catastrophically erode safety (e.g., by amplifying jailbreak vulnerabilities) when exposed to prompts just outside the original regime, because their separation in activation space no longer tracks the relevant property (Goyal et al., 5 Feb 2026).
5. Advanced Variants and Extensions
Advanced IDS techniques have emerged to address particular sources of brittleness and maximize intervention fidelity:
- Layer-Adaptivity: The “Where to Steer” framework (W2S) introduces input-dependent layer selection, learning a mapping from prompt embeddings to the empirically optimal steering site. On 13 Model-Written Evaluation tasks, W2S yields 15–25% absolute increases in steerability compared to fixed-layer baselines across multiple LLMs (Gadgil et al., 4 Apr 2026).
- Energy-Preserving Normalization: ROAST demonstrates that classic hard-masking (Top-k) of activation differences discards significant signal energy, and thus continuous soft scaling (CSS) with grouped normalization ensures robust estimation of IDS vectors while balancing semantic quality (Su et al., 15 Feb 2026).
- Distribution-Matching Objectives: Weakly supervised approaches (e.g., CDAS) directly match intervened output distributions to model-native counterfactual distributions using Jensen-Shannon metrics, enabling bi-directional, high-fidelity IDS and avoiding tuning of arbitrary steering factors (Bao et al., 5 Feb 2026).
- Trust-Region Steering Strength: Adaptive strength selection (as in (Vogels et al., 15 Oct 2025)) dynamically maximizes the in-distribution perturbation per input by computing the largest coefficient that keeps the activation within a learned Mahalanobis ball, eliminating global α tuning and preventing out-of-distribution collapse.
6. IDS in Control, Dynamical Systems, and Beyond
IDS is not limited to LLMs; the generic problem of steering stochastic or dynamic systems from one probability law to another under constraints is central to modern control theory:
- In discrete-time stochastic linear systems, affine disturbance-feedback is used to drive the final state distribution to a target, matching via the L₁-norm between characteristic functions and enforcing chance constraints via Gil-Pelaez inversion (Sivaramakrishnan et al., 2021).
- Moment-based convexification reduces infinite-dimensional steering to tractable finite-dimensional programs by interpolating power moments and reconstructing feasible control inputs via dual moment-matching (Wu et al., 2023, Wu et al., 2024).
- Neural maximum-likelihood IDS extends to fully nonlinear discrete-time systems by parameterizing the control law with invertible neural nets and matching terminal state distributions via KL divergence over the induced flow (Rapakoulias et al., 2024).
- In meta-learning and recommendation, IDS manifests as a hidden incentive for a policy to steer its own data distribution in ways confounded with the designer's notion of “improvement,” requiring unit tests and algorithmic mitigations to detect and suppress such self-selective drift (Krueger et al., 2020).
7. Limitations, Open Problems, and Recommendations
IDS provides a rigorous regime for isolating the effect of steering interventions on their intended property and for diagnosing unintended degradation of general or control-specific capabilities under ideal evaluation. However, its primary limitation is the lack of any guarantee under distributional shift, adversarial attack, or unforeseen prompt contexts. In safety-critical applications, reliance solely on in-distribution metrics is fundamentally misleading—a model may retain strong apparent specificity in-distribution while exhibiting catastrophic failures out-of-distribution (Goyal et al., 5 Feb 2026).
Open research challenges include:
- Developing diagnostic metrics for input-level unreliability and anti-steerability, and reporting full steerability distributions rather than aggregated statistics (Tan et al., 2024).
- Disentangling spurious prompt-dependent biases from concept-aligned steering vectors via orthogonalization or richer aggregation protocols.
- Extending IDS approaches to multi-layer, higher-rank, or multimodal interventions, and characterizing the cumulative effect of distributed interventions (Gadgil et al., 4 Apr 2026, Su et al., 15 Feb 2026).
- More principled mechanisms for automating selection of intervention strength, adaptive layer/site choice, and trade-off tuning between efficacy and generality.
- Integrating adversarial robustness and formal verification alongside IDS as a mandatory safety requirement.
In synthesis, IDS is a crucial primitive for evaluating and understanding steering interventions, but must be complemented by robust specificity assessments for deployment in real-world systems (Goyal et al., 5 Feb 2026).