Papers
Topics
Authors
Recent
2000 character limit reached

Systematic Lexical-Ablation Experiments

Updated 13 December 2025
  • The paper demonstrates that early GPT-2 layers (0–3) are dedicated lexical sentiment detectors, with activation patching causing up to an 18% shift in output.
  • It employs a linear logistic probe on final-layer activations to quantify sensitivity, specificity, and context variance, confirming sharply localized lexical signals.
  • The study falsifies mid-layer contextual integration theories by showing that late transformer layers (8–11) alone integrate broader contextual cues.

Systematic lexical-ablation experiments refer to a class of mechanistic interpretability techniques that probe the causal contribution of individual lexical items—such as sentiment-bearing words—within deep neural LLMs. Specifically, these experiments implement controlled intervention on a model’s hidden-state activations at prescribed layers and positions, with the goal of isolating how and where lexical information is encoded, processed, and integrated with wider context. In the context of GPT-2, systematic lexical-ablation (operationalized by activation patching) reveals that early transformer layers (0–3) serve as dedicated lexical sentiment detectors, encoding localized, position-specific, and context-independent polarity signals, while contextual integration mechanisms appear only in the late layers (8–11) (Hatua, 7 Dec 2025).

1. Mathematical Formalism and Implementation

Systematic lexical-ablation in GPT-2 exploits activation patching, a targeted intervention method defined over the hidden-state matrices at each transformer layer. For any layer l∈L={0,1,…,11}l \in L = \{0,1,\ldots,11\}, the activations for a source sentence ss and a corrupted sentence cc are denoted Als, Alc∈RT×DA_l^s,\,A_l^c \in \mathbb{R}^{T \times D}, where TT is the number of tokens and DD the hidden dimension. A binary mask M∈{0,1}T×1M \in \{0,1\}^{T \times 1} selects the ablation/patching positions P⊆{1,…,T}P \subseteq \{1,\ldots,T\}. The patched activation is

A~l=M⊙Als+(1−M)⊙Alc\tilde{A}_l = M \odot A_l^s + (1-M) \odot A_l^c

where ⊙\odot denotes element-wise multiplication broadcast over DD. Zeroing out a lexical signal in AlsA_l^s is achieved by:

A^l=(1−M)⊙Als\hat{A}_l = (1-M) \odot A_l^s

In practice, the model proceeds with a forward pass up to layer l−1l-1 using the corrupted sentence. At layer ll, AlcA_l^c is replaced by A~l\tilde{A}_l or A^l\hat{A}_l, and the forward pass continues to the final layer. This isolates the causal effect of a specific token’s representation at a given layer on the model's output.

2. Experimental Setup and Key Variables

In the canonical setup, a linear logistic regression probe ff is trained—without updating the base model—on final-layer (layer 11) activations to classify sentiment (positive vs. negative) with 95% validation accuracy (evaluated on 5,000 held-out sentences). Lexical testing is conducted on 1,000 (source, corrupted) sentence pairs constructed by single-word substitutions (positive to negative and vice versa), across six distinct sentence contexts. For each (s,c)(s, c) pair and for each layer ll, the shift in sentiment prediction after patching is

Δl=y^(patch-to-pos)−y^(uncorrupted)\Delta_l = \hat{y}(\text{patch-to-pos}) - \hat{y}(\text{uncorrupted})

where y^\hat{y} is the probe output. The absolute value Sl=∣Δl∣S_l = |\Delta_l| quantifies the "sensitivity" or causal effect magnitude for layer ll. To quantify position specificity, patching is applied both at the true sentiment position PP and at q≠Pq \neq P (random off-target positions; average over 2,000 qq), yielding a specificity score σl=Δl(P)−Δl(q)\sigma_l = \Delta_l(P) - \Delta_l(q). Context independence is measured as the variance of Δl\Delta_l across k≈6k \approx 6 contexts. These metrics provide a multi-faceted profile of how lexical sentiment features are processed at each model layer.

3. Layerwise Results: Localization and Context Invariance

Experimental results demonstrate "Early Layer Dominance" in lexical sentiment encoding. Mean layerwise sensitivity for l=0l=0–$3$ is: S0=0.182S_0 = 0.182, S1=0.164S_1 = 0.164, S2=0.143S_2 = 0.143, S3=0.118S_3 = 0.118, while layers l≥4l \geq 4 yield Sl<0.08S_l < 0.08. This establishes that layers 0–3 encode nearly all the model's stable, token-level sentiment signal. Early layers are highly position-specific: σ1=0.147\sigma_1 = 0.147 (mean, p<0.001p < 0.001); σ0=0.132\sigma_0 = 0.132, σ2=0.128\sigma_2 = 0.128, σ3=0.101\sigma_3 = 0.101; off-target patching yields σl≈0\sigma_l \approx 0, confirming that sentiment signals are sharply localized within these activations. Context independence is quantified by Var(Δl)\text{Var}(\Delta_l): for l=0l=0–$3$, mean variance is $0.038$ compared to $0.356$ for l=4l=4–$11$, indicating early-layer sentiment features are context-agnostic.

Layer Sensitivity SlS_l Specificity σl\sigma_l Var(Δl)\text{Var}(\Delta_l) (Context)
0 0.182 0.132 Low (∼\sim0.038)
1 0.164 0.147 Low
2 0.143 0.128 Low
3 0.118 0.101 Low

Patching early layers can flip the model’s output by up to 18 percentage points, demonstrating their dominant causal role.

4. Contextual Integration and Falsification of Middle-Layer Hypotheses

Canonical theories proposed a "Middle Layer Concentration," whereby contextual phenomena such as negation, sarcasm, or domain-shift would be integrated primarily in layers 4–8 and distributed across specific regions (phenomenon specificity, distributed processing). Systematic lexical ablation reveals that these hypotheses are falsified: contextual integration is not localized to mid-network layers, but instead, context-dependent effects emerge only in late layers (8–11) through a unified, non-modular mechanism. Patch experiments on a suite of 8,000 contextually modified sentences show that only late-layer interventions alter the response to context-sensitive sentiment, whereas middle-layer patching (layers 4–7) has minimal impact. This indicates a non-hierarchical, late-stage contextual integration process in GPT-2 for sentiment computation (Hatua, 7 Dec 2025).

5. Methodological Significance and Generalizability

Systematic lexical-ablation provides direct, layer-resolved causal evidence for the circuit-level specialization of neural LLMs. In GPT-2 (117M parameters), this approach demonstrates that lexical sentiment encoding is primarily early-layer, sharply localized, and context invariant, in contrast to predicted hierarchical architectures. However, generalizability remains an open question: alternative model families (BERT, RoBERTa, GPT-3) or scaling regimes may exhibit distinct patterns. Moreover, the assumption that lexical information is localized at the token level may omit relevant signal decomposability within substructures (e.g., attention heads, MLP pathways).

6. Limitations and Prospects for Extension

Several methodological limitations and future directions warrant consideration:

  • Reliance on a single linear probe constrains findings to a specific classifier; additional probing methods or full model fine-tuning may reveal further detail.
  • Token-level patching assumes all relevant signal is localized; circuit-level analyses targeting, for instance, specific attention heads or MLP neurons may provide more granular interpretability.
  • Zero-out ablations may induce unnatural distributional shifts; rescaling interventions could mitigate this by preserving activation norms.
  • The context-independence test is limited to six frames; broader coverage, including figurative language or domain-adapted corpora, could test the robustness of early-layer lexical encoding.
  • Detailed mapping of within-layer microstructure (e.g., "circuit brush" analyses) may enable pruning or more targeted interpretability interventions.

In aggregate, systematic lexical-ablation conclusively demonstrates that GPT-2 encodes word-level sentiment predominantly in layers 0–3, with these signals being sharply localized and context-invariant, while contextual integration occurs only in the late-stage transformer layers—contradicting the anticipated mid-layer hub model (Hatua, 7 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Systematic Lexical-Ablation Experiments.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube