Compositional Steering Vector Ensembles
- Compositional SVEs are techniques that combine multiple meaningful steering vectors to modulate LLM behavior at inference time.
- They use methods like Bayesian optimization and ensemble averaging to achieve robust improvements in task performance and bias mitigation.
- Empirical results show significant gains, including up to +13.1% improvement on tasks like code generation and enhanced bias reduction.
Compositional Steering Vector Ensembles (SVEs) are a family of techniques for modifying the internal representations of LLMs at inference time by injecting specially constructed vectors—steering vectors—into model activations. Rather than relying on a single static steering direction per task or axis, SVEs combine multiple conceptually meaningful directions, enabling flexible, robust, and interpretable control over LLM behavior across diverse domains such as reasoning, safety, and bias mitigation (Han et al., 7 Feb 2026, Siddique et al., 7 Mar 2025).
1. Steering Vectors: Definition and Extraction
A steering vector is a direction in the hidden state space of a transformer-based LLM, typically with dimension , that modifies model behavior when injected into intermediate activations. Formally, for an LLM and hidden state at layer for input : where is the steering vector. Steering vectors can be constructed by contrasting activations between positive and negative prompts for a concept , e.g., , and averaging across templates (Han et al., 7 Feb 2026). In bias mitigation, principal components derived from contrastive datasets (e.g., age, gender axes) are used: where is formed from differences in activations for positive/negative pairs (Siddique et al., 7 Mar 2025).
2. Ensemble and Compositional Methods
Simple approaches inject a single steering vector learned per task or axis. SVEs generalize this by linearly composing multiple semantically meaningful basis directions: This enables reuse and recombination of steerable dimensions, which can represent behavioral traits (e.g., Big Five for reasoning, fairness/sycophancy for safety) or bias axes (e.g., age, gender, race). Ensembles in bias mitigation are often formed by unit-normalized averaging: Optionally, weighted sums can emphasize certain conceptual axes.
3. Subspace Construction and Few-Shot Composition
To realize compositional steering, a semantic prior subspace is constructed, where . Steer2Adapt (Han et al., 7 Feb 2026) selects concepts relevant to the domain (e.g., reasoning or safety), engineers positive/negative calibration prompts, and computes basis vectors as average activation differences.
For adaptation to new tasks, Steer2Adapt discovers an optimal composition vector : using Bayesian optimization driven by a support set (typically 12 calibration examples). The objective is stability-aware, combining gains on previously wrong examples with heavy penalties for degrading correct outputs. Optimization employs a Gaussian Process surrogate with a Matern-5/2 kernel, an expected improvement acquisition function, and up to 400 evaluations over .
4. Inference-Time Application
SVEs are injected at specific model layers (often even-numbered residual blocks for high-level semantic control). At inference, the composed or ensemble vector is added to the hidden state(s), and the model continues forward propagation unchanged. Injection can be layer-specific (e.g., at layers of greatest inter-vector alignment in bias mitigation (Siddique et al., 7 Mar 2025)) or multi-layer for broader coverage.
SVE Inference Pseudocode (Steer2Adapt Paradigm)
1 2 3 4 5 6 7 8 9 10 11 12 |
Input: LLM fθ, basis B, support B, injection layer ℓ, BO budget T
Initialize GP surrogate over α ∈ [-2,2]^k
for t = 1 … T:
α_t ← argmax ExpectedImprovement(α)
Evaluate J(α_t)
Update GP with (α_t, J(α_t))
α* ← best α_t found
v* ← Bα*
For each new input x:
compute h_ℓ(x)
h'_ℓ(x) ← h_ℓ(x) + v*
propagate forward |
5. Empirical Performance and Interpretation
SVEs demonstrate improvements in both adaptation and bias mitigation, as illustrated below.
| Model | Baseline | Individual SV | SVE |
|---|---|---|---|
| Mistral 7B | 53.6% | 65.5% | 69.3% |
| Llama 3.1 8B | 75.9% | 80.1% | 81.6% |
| Qwen 2.5 7B | 84.7% | 86.1% | 86.9% |
Steer2Adapt achieves an average gain of +8.2% task accuracy across nine tasks and three LLMs, with highest improvement on code generation (+13.1%) and sycophancy reduction (+11.7%) (Han et al., 7 Feb 2026).
Key properties include:
- Data efficiency: Only calibration examples; no weight updates.
- Stability: Risk-averse objectives, empirical absence of negative performance dips.
- Transparency: Learned coefficients are interpretable; radar plots elucidate concept relevance.
- Generalization: SVEs outperform individual steering vectors on out-of-distribution bias axes and maintain general language competence, with an average drop of 2.4% in BLiMP vs. +7.5% task gain.
- Computational cost: Only basis construction and vector averaging per ensemble; negligible inference overhead (Siddique et al., 7 Mar 2025).
6. Theoretical Rationale and Limitations
The ensemble approach is motivated by the suppression of steerability bias—dataset-specific artifacts—through averaging, reinforcing common factors across vectors and reducing overfitting to any single axis or dataset. Empirical studies support this, with SVEs consistently surpassing best individual vectors in bias mitigation tasks.
Limitations include:
- Manual basis selection: Reliance on concept engineering rather than automated discovery.
- Entanglement: Concept vectors are not fully orthogonal, leading to possible non-obvious interactions.
- Scalability: As increases, Bayesian optimization becomes more challenging.
- Domain coverage: Current basis sets target reasoning and safety; extension to other modalities and nuanced domains remains open.
A plausible implication is that future work will involve structured or sparse compositions and exploration of automated, possibly unsupervised, basis construction to increase scalability and coverage (Han et al., 7 Feb 2026).
7. Applications and AI-Safety Implications
SVEs have been applied to:
- Efficient LLM adaptation: Steer2Adapt composes task-specific behaviors from few-shot data without fine-tuning (Han et al., 7 Feb 2026).
- Bias mitigation: SVE reduces social bias while largely retaining general capability, outperforming both single-vector and prompt-based baselines (Siddique et al., 7 Mar 2025).
- Interpretable control: Analysis of learned weights and vector similarities reveals latent structure of behavioral or bias dimensions, supporting transparency in deployment.
SVE interventions are lightweight, modular, and impose minimal trade-offs between safety/alignment and utility, establishing them as practical tools for aligning foundation models with downstream requirements (Siddique et al., 7 Mar 2025, Han et al., 7 Feb 2026).