Language Steerability in LLMs

Updated 4 July 2026

Language-steerability is the capacity of models to modify outputs through controlled interventions, such as prompt changes or hidden state adjustments, to target specific linguistic or stylistic features.
Methods include linear activation steering, sparse feature modifications, and language-vector techniques, all of which provide measurable improvements in multilingual and style-specific generation.
Empirical findings reveal that steerability emerges during intermediate pretraining stages and depends on careful intervention scaling and robust evaluation protocols to minimize side effects.

Language-steerability is the capacity of a LLM, or of a language-conditioned predictive system, to alter its outputs toward a specified target by changing prompts, profiles, hidden representations, sparse features, or token distributions at inference time. In the current literature, the term covers at least three related phenomena: controllable generation of a target language in multilingual models, controllable expression of semantic or stylistic concepts such as emotion or figurative language, and controllable adaptation to user, community, or persona-specific preferences. A central finding is that steerability is not reducible to mere concept encoding: a model may “know” a concept long before it becomes reliably steerable through simple interventions (She et al., 3 Aug 2025). In multilingual settings, this has motivated a family of “language vector” methods that treat languages as directions in an internal semantic space and modify activations without parameter updates (Kirtane et al., 2 Feb 2026).

1. Formal definitions and representational viewpoint

A standard formalization of linear steerability treats a hidden state $h_\ell \in \mathbb{R}^m$ at layer $\ell$ as the object of intervention and applies a concept direction $v_\ell$ with steering strength $\alpha$ :

$h'_\ell = h_\ell + \alpha \cdot v_\ell.$

Within the “Intervention Detector” framework, positive and negative stimuli for a concept are used to collect last-token hidden representations, form normalized difference matrices, extract the top principal component by PCA, and score alignment by

$I_{l,i} = \langle R(M,s_i)[-1], v_\ell \rangle.$

The resulting checkpoint-by-layer matrix is used to analyze where and when linear steerability emerges during pretraining (She et al., 3 Aug 2025).

In multilingual steering, a language direction is often defined as an activation-difference vector between semantically matched source- and target-language prompts. One formulation computes a layer- $t$ steering vector

$v^{(t)} = \mathbb{E}_{x \sim D_{\mathrm{compute}}}[h^{(t)}(x^t)] - \mathbb{E}_{x \sim D_{\mathrm{compute}}}[h^{(t)}(x^s)],$

then injects it during inference by replacing token-position activations with $h_p^{(t)} \leftarrow h_p^{(t)} + \alpha \cdot v^{(t)}$ . A related multilingual formulation, ReCoVeR, isolates language-specific vectors $r_\ell^{(i)} = v_\ell^{(i)} - c^{(i)}$ from a multi-parallel corpus and either adds the normalized target vector or adds the target vector while subtracting the normalized source vector in cross-lingual settings (Kirtane et al., 2 Feb 2026).

A more localized version of the same idea appears in sparse feature steering. There, a pretrained sparse autoencoder maps a residual-stream activation $\ell$ 0 to a sparse code $\ell$ 1, and a single feature index $\ell$ 2 is modified:

$\ell$ 3

This replaces diffuse residual steering with a monosemantic or near-monosemantic feature intervention (Chou et al., 17 Jul 2025).

Steerability is also formalized outside hidden-state editing. In natural-language recommenders, a steering intervention is a function $\ell$ 4 on a natural-language user profile, and success is measured by a tag-specific ranking shift $\ell$ 5. In multilingual system prompting, cross-lingual prompt steerability is represented by the four-dimensional metric vector $\ell$ 6, with an aggregated $\ell$ 7 built from min–max normalized components (Zhou et al., 28 Jan 2026).

2. Emergence during training and internal geometry

A key empirical result is that linear steerability emerges during intermediate stages of pretraining rather than appearing uniformly from the start. In CrystalCoder (7B) checkpoints saved every $\ell$ 8 steps, “anger” steerability remains near zero until $\ell$ 9 of training and then rises rapidly to $v_\ell$ 0 in higher layers. “Fear” emerges slightly earlier at about $v_\ell$ 1, “happiness” around $v_\ell$ 2– $v_\ell$ 3, while “sadness,” “surprise,” and “disgust” become steerable only near the very end, at $v_\ell$ 4 of training. The same study reports that the first PCA component of the concept-difference matrix explains only $v_\ell$ 5 of variance at $v_\ell$ 6– $v_\ell$ 7 of training but exceeds $v_\ell$ 8 by $v_\ell$ 9– $\alpha$ 0, and that cosine similarity between adjacent checkpoint concept vectors drops sharply at the moment steerability appears. The authors interpret this as increasing linear separability and signal-to-noise ratio in the hidden space (She et al., 3 Aug 2025).

The same work treats linear steerability as a distinct emergent capability, separate from concept encoding or raw generation ability. Heatmaps of checkpoint-by-layer ID scores show that early training is characterized by $\alpha$ 1 across layers, whereas after emergence the top $\alpha$ 2 layers form a bright band of strong alignment. Entropy over normalized layer scores is high early, drops as a few layers concentrate the concept, and then rebounds slightly when many layers become aligned. A plausible implication is that pretraining induces a reorganization from diffuse representation to layer-localized control, after which simple additive interventions become effective (She et al., 3 Aug 2025).

Later multilingual work reports a related geometric picture. CLaS-Bench finds that language-specific structure emerges predominantly in later layers and that steering directions cluster by language family. “Cross-Lingual Steering for Figurative Language Generation” similarly reports a reusable but target-dependent cross-lingual signal: directions learned from figurative–literal activation differences transfer across six languages, and removing the shared component weakens native steering (Gurgurov et al., 13 Jan 2026).

3. Intervention families and their empirical performance

Several intervention families now coexist, differing mainly in the representation they edit and in how the steering direction is extracted.

Family	Representation edited	Core update
Linear activation steering	Residual or hidden state	$\alpha$ 3
Sparse feature steering	SAE code	$\alpha$ 4
Language-vector steering	Layerwise mean-pooled activations	$\alpha$ 5
ReCoVeR	Hidden states with centered language vectors	Add target vector, subtract source vector in Cross-LC
DLM-SWAI	Token logits in diffusion denoising	$\alpha$ 6
Neural FOXP2	Sparse language-neuron support	Signed sparse shift in SAE feature space

Sparse feature steering shows that a single SAE feature can be sufficient for deterministic language control. On Gemma-2-9B, steering one feature yields FastText target-language accuracies of $\alpha$ 7 for Chinese, $\alpha$ 8 for Japanese, $\alpha$ 9 for Spanish, and $h'_\ell = h_\ell + \alpha \cdot v_\ell.$ 0 for French, while preserving semantic fidelity measured by LaBSE similarity. The strongest interventions occur in mid-to-late layers, such as layers $h'_\ell = h_\ell + \alpha \cdot v_\ell.$ 1– $h'_\ell = h_\ell + \alpha \cdot v_\ell.$ 2 in Gemma-2-9B, and specific attention heads are disproportionately aligned with language-sensitive features; for example, Head $h'_\ell = h_\ell + \alpha \cdot v_\ell.$ 3 in layer $h'_\ell = h_\ell + \alpha \cdot v_\ell.$ 4 dominates for both Chinese and French (Chou et al., 17 Jul 2025).

Training-free language vectors have been applied to multilingual in-context learning. On Llama-3.1-8B-Instruct, language steering improves MGSM from $h'_\ell = h_\ell + \alpha \cdot v_\ell.$ 5 to $h'_\ell = h_\ell + \alpha \cdot v_\ell.$ 6, XNLI from $h'_\ell = h_\ell + \alpha \cdot v_\ell.$ 7 to $h'_\ell = h_\ell + \alpha \cdot v_\ell.$ 8, and MSVAMP from $h'_\ell = h_\ell + \alpha \cdot v_\ell.$ 9 to $I_{l,i} = \langle R(M,s_i)[-1], v_\ell \rangle.$ 0. On Qwen-2.5-14B, MGSM rises from about $I_{l,i} = \langle R(M,s_i)[-1], v_\ell \rangle.$ 1 to about $I_{l,i} = \langle R(M,s_i)[-1], v_\ell \rangle.$ 2. Hierarchical clustering of steering vectors yields Romance, Slavic, Indo-Aryan, and East Asian groupings, and five of six cross-task transfers among MGSM, MSVAMP, and XNLI improve over baseline, with one failure case in the MSVAMP $I_{l,i} = \langle R(M,s_i)[-1], v_\ell \rangle.$ 3XNLI direction (Kirtane et al., 2 Feb 2026).

ReCoVeR addresses language confusion rather than few-shot transfer. Its fixed version adds normalized target-language vectors or target-minus-source vectors; its supervised version, ReCoVeR+, learns a small low-rank residual block while freezing the LLM. On cross-lingual language control in LCB, ReCoVeR+ raises LPR from $I_{l,i} = \langle R(M,s_i)[-1], v_\ell \rangle.$ 4 to $I_{l,i} = \langle R(M,s_i)[-1], v_\ell \rangle.$ 5 on Llama 3.1, from $I_{l,i} = \langle R(M,s_i)[-1], v_\ell \rangle.$ 6 to $I_{l,i} = \langle R(M,s_i)[-1], v_\ell \rangle.$ 7 on Qwen 2.5, and from $I_{l,i} = \langle R(M,s_i)[-1], v_\ell \rangle.$ 8 to $I_{l,i} = \langle R(M,s_i)[-1], v_\ell \rangle.$ 9 on Gemma 2. On MMLU, steering with ReCoVeR never drops accuracy by more than $t$ 0 percentage points, whereas LSI drops up to about $t$ 1 points (Sterz et al., 18 Sep 2025).

Benchmark-scale comparisons are less favorable to many sophisticated steering directions than to simple residual means. In CLaS-Bench, the average harmonic-mean steering score $t$ 2 on Llama-3.1-8B-Instruct is $t$ 3 for DiffMean, compared with $t$ 4 for LAPE, $t$ 5 for probe-derived directions, $t$ 6 for PCA steering, $t$ 7 for LDA steering, and $t$ 8 for SAE-DiffMean. The two prompting baselines score $t$ 9 and $v^{(t)} = \mathbb{E}_{x \sim D_{\mathrm{compute}}}[h^{(t)}(x^t)] - \mathbb{E}_{x \sim D_{\mathrm{compute}}}[h^{(t)}(x^s)],$ 0. This suggests that unsupervised difference-of-means directions can be more robust than probe-derived or low-dimensional reconstruction-based directions for multilingual language forcing (Gurgurov et al., 13 Jan 2026).

The scope of steerability methods has also broadened beyond autoregressive transformers. DLM-SWAI biases token distributions in diffusion LLMs at every denoising step using precomputed token-level style scores, with no auxiliary model and no hidden-state hooks. On OSE readability control, DLM-SWAI reaches $v^{(t)} = \mathbb{E}_{x \sim D_{\mathrm{compute}}}[h^{(t)}(x^t)] - \mathbb{E}_{x \sim D_{\mathrm{compute}}}[h^{(t)}(x^s)],$ 1 accuracy and macro- $v^{(t)} = \mathbb{E}_{x \sim D_{\mathrm{compute}}}[h^{(t)}(x^t)] - \mathbb{E}_{x \sim D_{\mathrm{compute}}}[h^{(t)}(x^s)],$ 2 on LLaDA-8B and $v^{(t)} = \mathbb{E}_{x \sim D_{\mathrm{compute}}}[h^{(t)}(x^t)] - \mathbb{E}_{x \sim D_{\mathrm{compute}}}[h^{(t)}(x^s)],$ 3 and $v^{(t)} = \mathbb{E}_{x \sim D_{\mathrm{compute}}}[h^{(t)}(x^t)] - \mathbb{E}_{x \sim D_{\mathrm{compute}}}[h^{(t)}(x^s)],$ 4 on Dream-7B, while on RealTox it reaches $v^{(t)} = \mathbb{E}_{x \sim D_{\mathrm{compute}}}[h^{(t)}(x^t)] - \mathbb{E}_{x \sim D_{\mathrm{compute}}}[h^{(t)}(x^s)],$ 5 non-toxic accuracy. Neural FOXP2, by contrast, identifies a sparse, low-rank “language-neuron” circuit and applies signed sparse activation shifts in low-to-mid layers; on LLaMA-3 8B it reports $v^{(t)} = \mathbb{E}_{x \sim D_{\mathrm{compute}}}[h^{(t)}(x^t)] - \mathbb{E}_{x \sim D_{\mathrm{compute}}}[h^{(t)}(x^s)],$ 6, $v^{(t)} = \mathbb{E}_{x \sim D_{\mathrm{compute}}}[h^{(t)}(x^t)] - \mathbb{E}_{x \sim D_{\mathrm{compute}}}[h^{(t)}(x^s)],$ 7, $v^{(t)} = \mathbb{E}_{x \sim D_{\mathrm{compute}}}[h^{(t)}(x^t)] - \mathbb{E}_{x \sim D_{\mathrm{compute}}}[h^{(t)}(x^s)],$ 8, Spanish leakage of $v^{(t)} = \mathbb{E}_{x \sim D_{\mathrm{compute}}}[h^{(t)}(x^t)] - \mathbb{E}_{x \sim D_{\mathrm{compute}}}[h^{(t)}(x^s)],$ 9, and $h_p^{(t)} \leftarrow h_p^{(t)} + \alpha \cdot v^{(t)}$ 0 (An et al., 28 May 2026).

4. Evaluation protocols and benchmark design

The diversity of steering methods has been matched by a rapid diversification of evaluation protocols. CLaS-Bench defines language forcing success $h_p^{(t)} \leftarrow h_p^{(t)} + \alpha \cdot v^{(t)}$ 1 as the fraction of outputs in the target language according to FastText LID, semantic relevance $h_p^{(t)} \leftarrow h_p^{(t)} + \alpha \cdot v^{(t)}$ 2 as a normalized $h_p^{(t)} \leftarrow h_p^{(t)} + \alpha \cdot v^{(t)}$ 3– $h_p^{(t)} \leftarrow h_p^{(t)} + \alpha \cdot v^{(t)}$ 4 multilingual judge score, and combines them with the harmonic mean

$h_p^{(t)} \leftarrow h_p^{(t)} + \alpha \cdot v^{(t)}$ 5

Its construction yields $h_p^{(t)} \leftarrow h_p^{(t)} + \alpha \cdot v^{(t)}$ 6 steering instances across $h_p^{(t)} \leftarrow h_p^{(t)} + \alpha \cdot v^{(t)}$ 7 languages and provides a standardized multilingual benchmark for prompt-based and representation-based interventions alike (Gurgurov et al., 13 Jan 2026).

SteerEval formalizes steerability for natural-language recommenders. Given a profile revision $h_p^{(t)} \leftarrow h_p^{(t)} + \alpha \cdot v^{(t)}$ 8 for a tag $h_p^{(t)} \leftarrow h_p^{(t)} + \alpha \cdot v^{(t)}$ 9, it computes a tag-specific ranking AUC and its change

$r_\ell^{(i)} = v_\ell^{(i)} - c^{(i)}$ 0

The framework distinguishes increase and decrease interventions, measures changes in the position of the ground-truth next item within its relevant or irrelevant subset, and evaluates both broad tags such as movie genres and finer-grained tags such as trigger warnings. Genres are substantially easier to steer than triggers: $r_\ell^{(i)} = v_\ell^{(i)} - c^{(i)}$ 1 and $r_\ell^{(i)} = v_\ell^{(i)} - c^{(i)}$ 2 for genres, versus $r_\ell^{(i)} = v_\ell^{(i)} - c^{(i)}$ 3 and $r_\ell^{(i)} = v_\ell^{(i)} - c^{(i)}$ 4 for triggers. Oracle metadata sharply improves both, indicating that world-knowledge limitations are a major bottleneck (Zhou et al., 28 Jan 2026).

A different evaluation philosophy appears in “A Course Correction in Steerability Evaluation,” which models user goals and model outputs as vectors in a multi-dimensional goal space $r_\ell^{(i)} = v_\ell^{(i)} - c^{(i)}$ 5. It defines overall steering error as the expected $r_\ell^{(i)} = v_\ell^{(i)} - c^{(i)}$ 6 distance between the achieved goal vector $r_\ell^{(i)} = v_\ell^{(i)} - c^{(i)}$ 7 and the target $r_\ell^{(i)} = v_\ell^{(i)} - c^{(i)}$ 8, and decomposes failures into miscalibration, which measures overshoot or undershoot along the desired change direction, and orthogonality, which measures unintended drift in non-target dimensions. On a four-dimensional text-rewriting task with reading difficulty, formality, lexical diversity, and length, side effects remain persistent even when prompt engineering, best-of- $r_\ell^{(i)} = v_\ell^{(i)} - c^{(i)}$ 9 sampling, or reinforcement learning is applied (Chang et al., 27 May 2025).

Prompt-only steering has also acquired its own multilingual evaluation framework. “Cross-Lingual Prompt Steerability” defines $\ell$ 00, $\ell$ 01, $\ell$ 02, and $\ell$ 03, then combines them into $\ell$ 04 with weights $\ell$ 05, $\ell$ 06, $\ell$ 07, and $\ell$ 08. Across Qwen2.5-7B-Instruct, LLaMA-3.1-8B-Instruct, and Gemma-3-12B-IT, optimized prompts improve mean accuracy by $\ell$ 09, $\ell$ 10, and $\ell$ 11, respectively, while also increasing cross-lingual consistency and reducing unnecessary language-switching; for Spanish, the share of reasoning units in the native language rises from $\ell$ 12 to $\ell$ 13 after optimization (Zhang et al., 2 Dec 2025).

Earlier work used psychometric or choice-based proxies. The OCEAN-based framework sums integer trait ratings to obtain a trait-specific steerability score $\ell$ 14 and visualizes overlap between prompted personalities; it found pronounced peaks for Conscientiousness and Neuroticism and overlap between Extraversion and Agreeableness. STEER-BENCH instead evaluates community-specific steering as multiple-choice accuracy after conditioning on community-aligned examples, using $\ell$ 15 contrasting subreddit pairs and $\ell$ 16 validated questions (Noever et al., 2023).

5. Theoretical accounts, diagnostics, and failure modes

The strongest theoretical treatment of steering magnitude appears in “Towards Understanding Steering Strength.” It studies the dependence of token probabilities, concept presence, and cross-entropy on the scalar steering strength $\ell$ 17. The paper derives a “Bump Law,” under which most token-probability shifts increase and then decrease once $\ell$ 18 exceeds a token-specific threshold; a “Sigmoidal Law,” under which concept-level probability shifts follow an S-shaped curve; a “Quadratic Law,” under which cross-entropy grows like $\ell$ 19 near $\ell$ 20 with no linear term; and a “Saturation Law,” under which the distribution collapses onto top log-odds tokens as $\ell$ 21. The analysis is validated on eleven decoder-only transformers and implies that steering has a non-monotonic sweet spot rather than a monotone gain regime (Taimeskhanov et al., 2 Feb 2026).

A different explanation of steering instability is offered by the Cylindrical Representation Hypothesis. CRH retains linear concept directions but drops the assumption that concept directions can be made orthogonal without loss. It posits a sample-specific cylindrical geometry: a central axis $\ell$ 22 captures the main concept difference, a normal plane $\ell$ 23 controls steering sensitivity, and only certain angular sectors in that plane strongly facilitate concept activation. The paper argues that the magnitude of the normal-plane component is predictable, whereas the sensitive sector is not. Its empirical verification reports effectively zero correlation between difference-vector cosine similarity and sample-wise steering-strength difference, with Pearson correlation $\ell$ 24 and $\ell$ 25, and interprets this as intrinsic uncertainty at the sector level (Gao et al., 3 May 2026).

Work on geometric diagnostics complements these theories. “The Geometric Canary” distinguishes supervised and unsupervised geometric stability. Supervised Shesha variants predict linear steerability with Spearman $\ell$ 26 on $\ell$ 27 synthetic models, $\ell$ 28 on SST-2, and $\ell$ 29 on MNLI, while retaining substantial partial correlations after controlling for separability measures. By contrast, unsupervised stability fails for steering on real tasks, with $\ell$ 30 on SST-2, yet excels at drift detection, measuring on average $\ell$ 31 greater geometric change than CKA and as much as $\ell$ 32 in the Llama family (Raju, 20 Apr 2026).

Large-scale empirical audits show that many steering methods remain brittle. “Steering off Course” evaluates DoLa, function vectors, and task vectors on up to $\ell$ 33 models from $\ell$ 34 families. It finds only modest or negative gains for DoLa on TruthfulQA and FACTOR, and large variability for activation patching: under default settings, function vectors recover at least $\ell$ 35 of the five-shot baseline in only $\ell$ 36 of model–task combinations, while task vectors do so in $\ell$ 37; even with extensive search, recoveries remain inconsistent across families and tasks. The paper attributes these failures to flawed assumptions about where knowledge is localized and how it is promoted across layers (Silva et al., 6 Apr 2025).

Prompt-based steering can fail even more directly in high-stakes settings. In the college-admissions essay study, LLM-generated essays are readily distinguishable from human essays, with F1 approximately $\ell$ 38 using T5 embeddings and $\ell$ 39 using TF-IDF for the LLM-versus-human comparison. Demographic prompting is “remarkably ineffective”: the prompted and unprompted synthetic essays are more similar to each other than to human text, and prompting causes lexical insertions such as “Asian,” “parent,” and “California” without changing deeper stylistic traits. This exposes a persistent gap between surface instruction following and authentic steerability (Lee et al., 25 Mar 2025).

6. Applications, cross-lingual transfer, and open problems

Language-steerability has become a practical tool for multilingual control. In figurative language generation, a direction estimated from figurative–literal activation differences in one language can be applied in another. Across five figurative categories, six languages, and four multilingual LLMs, $\ell$ 40 of $\ell$ 41 non-monolingual routes yield positive target-category gains, with metaphor and simile transferring most robustly. German is reported as the most receptive target language, Bengali as the weakest, and leave-target-out mean vectors “win” or “tie” native steering in $\ell$ 42– $\ell$ 43 of settings. The authors present this as direct evidence of a reusable, language-agnostic but target-dependent cross-lingual signal (Liu et al., 28 May 2026).

The same control logic extends beyond language identity. Data-driven personas derived by collaborative filtering improve macro prediction accuracy by $\ell$ 44– $\ell$ 45 over the best prompting baselines on OpinionQA, depending on model, when converted into soft prompts by a learned prefix model. STEER-BENCH shows that community-sensitive steerability is measurable at scale but still far from human performance: human experts reach $\ell$ 46 accuracy with silver labels, the best models reach about $\ell$ 47– $\ell$ 48, and the weakest model reaches about $\ell$ 49. In recommendation, SteerEval finds that LLM rewriting of user profiles yields the strongest steering among tested interventions, while the relative position of the true next item changes by at most about $\ell$ 50 on average, indicating limited loss of baseline preference information (Li et al., 2023).

The literature therefore converges on a mixed assessment. Steerability is real, often strong, and increasingly interpretable. It can emerge during pretraining, localize in later layers, transfer across tasks and languages, and sometimes be driven by a single sparse feature or a compact low-rank subspace. At the same time, robustness is conditional: steering strength is non-monotonic, geometry can be sample-specific, unsupervised diagnostics may fail to predict controllability, prompt-based identity steering can remain superficial, and widely used intervention recipes can degrade or fail across model families. The present state of the field suggests that reliable language-steerability depends on three ingredients in combination: a representation in which the target is linearly accessible, an intervention scale that remains inside the model’s quality-preserving regime, and an evaluation protocol that measures not only target attainment but also semantic preservation, side effects, and cross-context generalization (She et al., 3 Aug 2025).