Per-Token Susceptibility Analysis
- Per-token susceptibility analysis is a framework that quantifies token-level sensitivity using a susceptibility matrix inspired by statistical physics.
- It employs UMAP for dimensionality reduction to visually uncover emergent structures like the induction circuit and spacing fin in neural networks.
- The framework enables functional attribution of subnetworks by tracking developmental dynamics across training stages, aiding model interpretability and diagnosis.
Per-token susceptibility analysis is a framework for quantifying and interpreting the sensitivity of a neural network’s behavior, particularly in LLMs and Transformer-based architectures, to perturbations associated with individual tokens within an input sequence. Drawing from concepts in statistical physics, such as linear response and susceptibility, this paradigm enables the dissection of model internals at a granular, token-specific level. It exposes how changes in network components or data distributions affect loss and functional structure, informs interpretability, and illuminates the developmental “anatomy” of deep learning systems.
1. Mathematical Foundations and Susceptibility Matrix
The central mathematical construct is the susceptibility matrix, which encodes the per-token response of network components to infinitesimal perturbations. For a token-context pair (x, y), a susceptibility vector is defined as
where H is the number of components (typically attention heads), and each entry
expresses the covariance—over posterior weight draws parameterized by an inverse temperature β—between a generalized observable φ_C associated with component C and the difference between the per-token loss and the population loss .
This matrix construction is rooted in a Bayesian statistical mechanical perspective. The model parameters w are distributed as
where L_n(w) is the empirical loss, and β is the inverse temperature parameter. The susceptibility captures the network’s linear response—that is, the leading order change in posterior expectation of any observable φ under a small, controlled distributional perturbation.
2. Visualization and Dimensionality Reduction Using UMAP
Given the high-dimensionality of the susceptibility matrix, visualization is achieved using Uniform Manifold Approximation and Projection (UMAP). Prior to projection, each column (component/attention head) is standardized to have zero mean and unit variance. UMAP parameters such as number of neighbors and minimum distance are tuned (e.g., n_neighbors = 45, min_dist = 0.1) to preserve local geometric structure.
The resulting two-dimensional embedding yields a “rainbow serpent” configuration where distinct classes of token sequences (e.g., word starts, word ends, numerics) occupy stratified regions. The axes of the UMAP projection often align with principal components of the susceptibility data:
- PC1 typically differentiates suppression versus expression across components (posterior–anterior offset).
- PC2 captures dorsal–ventral variation, often reflecting the width or separation of distinct functional pathways.
Such embeddings allow one to observe macroscopic anatomical organization and track the emergence of structural features during training.
3. Emergent Structural Features and Body Plan
A key discovery is the “body plan”—a macroscopic, spatial organization in the UMAP plot representing the gene-expression-like profile of token sequences. Distinct regions correspond to functional circuits:
- Induction Circuit: Manifest as a thickening in the dorsal–ventral direction (PC2 axis), linked to patterns where rare bigrams repeat (the canonical “induction head” functionality where sequence repetition is modeled). The progression and emergence of the induction circuit can be observed as the structure becomes pronounced at later training stages.
- Stratification by Token Class: Word starts, word ends, numerics, and other syntactic categories align along specific regions of the projection.
This structural order is not static; embeddings at early stages of training are unstructured but acquire progressively more anatomical coherence as learning proceeds.
4. Discovery of Novel Structures: The Spacing Fin
Beyond validating known mechanisms, per-token susceptibility analysis enables the identification of previously uncharted functional modules. The “spacing fin” is one such emergent structure:
- Description: The spacing fin comprises a cluster in UMAP space corresponding to tokens y that are spacing characters (spaces, newlines, tabs), often following contexts ending in a sequence of similar tokens.
- Significance: Its presence suggests the model develops specialized subcircuits, possibly for accurately counting or differentiating sequences of spacing tokens—a function of both frequency and importance in LLMing that previously lacked explicit recognition.
The spacing fin’s migration, separation, and ultimate integration into the main body plan (as visualized across training stages) provide a dynamic perspective on the differentiation of functional specialization.
5. Developmental Dynamics and Training Progression
The developmental approach, inspired by embryology, divides training into multiple chronological stages (e.g., LM1–LM5), tracking the transformation of susceptibility structure:
- Early-stage susceptibility distributions are largely undifferentiated.
- Intermediate stages exhibit rapid migration of token classes in susceptibility space, reflecting the acquisition of representational specialization.
- The induction circuit becomes apparent as variance along PC2 increases sharply for induction-pattern tokens.
- The spacing fin initially separates from the main body plan and later reattaches, highlighting a non-monotonic trajectory of functional compartmentalization.
Across random seeds, these macro-scale developmental changes are robust and reproducible, indicating they are induced by architectural and training-data properties rather than stochastic initialization effects.
6. Consequences for Model Analysis and Architecture
The susceptibility matrix, and its accompanying manifold structure, provides a holistic and quantitative basis for:
- Functional Attribution: Assigning computational roles to subnetworks, circuits, or components based on their per-token susceptibility profiles.
- Generalization Analysis: Monitoring how internal structure evolves in concert with capability, informing the relationship between model anatomy and task performance.
- Circuit Discovery: Enabling unsupervised identification and categorization of mechanisms (e.g., induction, counting, token segmentation) without requiring ablations or external guidance.
This suggests that susceptibility analysis, through direct visualization of the model’s internal developmental “anatomy,” may inform principled approaches to model interpretability, diagnosis of misgeneralization, and the rational design of network architectures based on observed organizational principles and emergent functional modules.