Systematic Lexical-Ablation Experiments
- The paper demonstrates that early GPT-2 layers (0–3) are dedicated lexical sentiment detectors, with activation patching causing up to an 18% shift in output.
- It employs a linear logistic probe on final-layer activations to quantify sensitivity, specificity, and context variance, confirming sharply localized lexical signals.
- The study falsifies mid-layer contextual integration theories by showing that late transformer layers (8–11) alone integrate broader contextual cues.
Systematic lexical-ablation experiments refer to a class of mechanistic interpretability techniques that probe the causal contribution of individual lexical items—such as sentiment-bearing words—within deep neural LLMs. Specifically, these experiments implement controlled intervention on a model’s hidden-state activations at prescribed layers and positions, with the goal of isolating how and where lexical information is encoded, processed, and integrated with wider context. In the context of GPT-2, systematic lexical-ablation (operationalized by activation patching) reveals that early transformer layers (0–3) serve as dedicated lexical sentiment detectors, encoding localized, position-specific, and context-independent polarity signals, while contextual integration mechanisms appear only in the late layers (8–11) (Hatua, 7 Dec 2025).
1. Mathematical Formalism and Implementation
Systematic lexical-ablation in GPT-2 exploits activation patching, a targeted intervention method defined over the hidden-state matrices at each transformer layer. For any layer , the activations for a source sentence and a corrupted sentence are denoted , where is the number of tokens and the hidden dimension. A binary mask selects the ablation/patching positions . The patched activation is
where denotes element-wise multiplication broadcast over . Zeroing out a lexical signal in is achieved by:
In practice, the model proceeds with a forward pass up to layer using the corrupted sentence. At layer , is replaced by or , and the forward pass continues to the final layer. This isolates the causal effect of a specific token’s representation at a given layer on the model's output.
2. Experimental Setup and Key Variables
In the canonical setup, a linear logistic regression probe is trained—without updating the base model—on final-layer (layer 11) activations to classify sentiment (positive vs. negative) with 95% validation accuracy (evaluated on 5,000 held-out sentences). Lexical testing is conducted on 1,000 (source, corrupted) sentence pairs constructed by single-word substitutions (positive to negative and vice versa), across six distinct sentence contexts. For each pair and for each layer , the shift in sentiment prediction after patching is
where is the probe output. The absolute value quantifies the "sensitivity" or causal effect magnitude for layer . To quantify position specificity, patching is applied both at the true sentiment position and at (random off-target positions; average over 2,000 ), yielding a specificity score . Context independence is measured as the variance of across contexts. These metrics provide a multi-faceted profile of how lexical sentiment features are processed at each model layer.
3. Layerwise Results: Localization and Context Invariance
Experimental results demonstrate "Early Layer Dominance" in lexical sentiment encoding. Mean layerwise sensitivity for –$3$ is: , , , , while layers yield . This establishes that layers 0–3 encode nearly all the model's stable, token-level sentiment signal. Early layers are highly position-specific: (mean, ); , , ; off-target patching yields , confirming that sentiment signals are sharply localized within these activations. Context independence is quantified by : for –$3$, mean variance is $0.038$ compared to $0.356$ for –$11$, indicating early-layer sentiment features are context-agnostic.
| Layer | Sensitivity | Specificity | (Context) |
|---|---|---|---|
| 0 | 0.182 | 0.132 | Low (0.038) |
| 1 | 0.164 | 0.147 | Low |
| 2 | 0.143 | 0.128 | Low |
| 3 | 0.118 | 0.101 | Low |
Patching early layers can flip the model’s output by up to 18 percentage points, demonstrating their dominant causal role.
4. Contextual Integration and Falsification of Middle-Layer Hypotheses
Canonical theories proposed a "Middle Layer Concentration," whereby contextual phenomena such as negation, sarcasm, or domain-shift would be integrated primarily in layers 4–8 and distributed across specific regions (phenomenon specificity, distributed processing). Systematic lexical ablation reveals that these hypotheses are falsified: contextual integration is not localized to mid-network layers, but instead, context-dependent effects emerge only in late layers (8–11) through a unified, non-modular mechanism. Patch experiments on a suite of 8,000 contextually modified sentences show that only late-layer interventions alter the response to context-sensitive sentiment, whereas middle-layer patching (layers 4–7) has minimal impact. This indicates a non-hierarchical, late-stage contextual integration process in GPT-2 for sentiment computation (Hatua, 7 Dec 2025).
5. Methodological Significance and Generalizability
Systematic lexical-ablation provides direct, layer-resolved causal evidence for the circuit-level specialization of neural LLMs. In GPT-2 (117M parameters), this approach demonstrates that lexical sentiment encoding is primarily early-layer, sharply localized, and context invariant, in contrast to predicted hierarchical architectures. However, generalizability remains an open question: alternative model families (BERT, RoBERTa, GPT-3) or scaling regimes may exhibit distinct patterns. Moreover, the assumption that lexical information is localized at the token level may omit relevant signal decomposability within substructures (e.g., attention heads, MLP pathways).
6. Limitations and Prospects for Extension
Several methodological limitations and future directions warrant consideration:
- Reliance on a single linear probe constrains findings to a specific classifier; additional probing methods or full model fine-tuning may reveal further detail.
- Token-level patching assumes all relevant signal is localized; circuit-level analyses targeting, for instance, specific attention heads or MLP neurons may provide more granular interpretability.
- Zero-out ablations may induce unnatural distributional shifts; rescaling interventions could mitigate this by preserving activation norms.
- The context-independence test is limited to six frames; broader coverage, including figurative language or domain-adapted corpora, could test the robustness of early-layer lexical encoding.
- Detailed mapping of within-layer microstructure (e.g., "circuit brush" analyses) may enable pruning or more targeted interpretability interventions.
In aggregate, systematic lexical-ablation conclusively demonstrates that GPT-2 encodes word-level sentiment predominantly in layers 0–3, with these signals being sharply localized and context-invariant, while contextual integration occurs only in the late-stage transformer layers—contradicting the anticipated mid-layer hub model (Hatua, 7 Dec 2025).