On the Geometric Structure of Layer Updates in Deep Language Models

Published 2 Apr 2026 in cs.LG, cs.AI, and cs.CL | (2604.02459v1)

Abstract: We study the geometric structure of layer updates in deep LLMs. Rather than analyzing what information is encoded in intermediate representations, we ask how representations change from one layer to the next. We show that layerwise updates admit a decomposition into a dominant tokenwise component and a residual that is not captured by restricted tokenwise function classes. Across multiple architectures, including Transformers and state-space models, we find that the full layer update is almost perfectly aligned with the tokenwise component, while the residual exhibits substantially weaker alignment, larger angular deviation, and significantly lower projection onto the dominant tokenwise subspace. This indicates that the residual is not merely a small correction, but a geometrically distinct component of the transformation. This geometric separation has functional consequences: approximation error under the restricted tokenwise model is strongly associated with output perturbation, with Spearman correlations often exceeding 0.7 and reaching up to 0.95 in larger models. Together, these results suggest that most layerwise updates behave like structured reparameterizations along a dominant direction, while functionally significant computation is concentrated in a geometrically distinct residual component. Our framework provides a simple, architecture-agnostic method for probing the geometric and functional structure of layer updates in modern LLMs.

Abstract PDF Upgrade to Chat

Authors (1)

Jun-Sik Yoo

Summary

The paper introduces a decomposition method separating layer updates into dominant tokenwise transformations and non-local residuals.
It shows that tokenwise updates capture most energy through low-dimensional approximations, while residuals display distinct geometric properties that significantly alter outputs.
Evaluations across architectures demonstrate that model size and function class constraints impact approximation accuracy and the functional relevance of the residual components.

Geometric Structure of Layer Updates in Deep LLMs

Introduction

The paper "On the Geometric Structure of Layer Updates in Deep LLMs" (2604.02459) presents a functional-geometric analysis of how hidden representations are updated from one layer to the next in sequence models such as Transformers and state-space architectures. Departing from representation-centered interpretability paradigms, this work focuses on the structure of transformations mediating transitions between successive hidden states, offering a lens on model dynamics that is agnostic to architecture details.

Framework: Decomposition via Tokenwise Structure and Residuals

The central contribution is a decomposition of layerwise updates into a dominant tokenwise component and a residual term. Specifically, each transition $h_{\ell+1} = T(h_\ell) + r(h_\ell)$ is split into:

Tokenwise Transformation $T$ : An input-conditioned, per-token linear map acting locally on each token. Concretely, $T(x_i) = A(x_i)x_i$ where $A(x_i)$ is a $d \times d$ matrix smoothly varying with the token vector.
Residual $r(h_\ell)$ : The remainder that cannot be captured within the constraints of the tokenwise function class.

This decomposition is operationalized by fitting $T$ locally (via $k$ -nearest neighbors in the representation space) for every token, harnessing restricted classes such as low-rank or diagonal linear maps, with the residual interpreted as an empirical signal of what lies outside tokenwise locality.

Figure 1: The layerwise update is decomposed into a local linear tokenwise approximation and a residual representing non-local structure.

Geometric Analysis of Layer Updates

The analysis quantifies the geometric alignment between three update vectors per token:

Full update: $h_{\ell+1} - h_\ell$
Tokenwise update: $T(h_\ell) - h_\ell$
Residual: $T$ 0

Strong alignment is found between the full update and the tokenwise update across layers and models, with cosine similarities near 1 and small angular deviations. The residual, in contrast, displays substantial angular deviation and low alignment, implying it is a geometrically distinct component.

Figure 3: Distribution of alignment summaries for full versus tokenwise and residual versus tokenwise components, highlighting their geometric separation.

Projection analyses further demonstrate that the energy of the full and tokenwise updates is almost entirely recoverable in a low-dimensional subspace defined by $T$ 1, while the residual has low projection onto this space.

Figure 2: Fraction of energy of updates captured by leading singular vectors (top-1,4,8): residuals remain outside the dominant tokenwise subspace.

Sensitivity analyses show that model size, neighborhood size in local fitting, and the chosen function class (e.g. rank or diagonal constraints) impact the alignment and residual magnitudes. Larger models and higher-rank approximations yield lower errors, but the residual's geometric distinctiveness persists.

Figure 4: Effect of neighborhood size and rank on approximation error in Pythia-70M; higher locality and larger rank improve the fit but do not eliminate the residual.

Functional Consequences and Error-Output Alignment

The functional relevance of the decomposition is probed by intervening—replacing a layer update with its tokenwise approximation and measuring the impact on the model's output distribution. Output perturbation, quantified by KL divergence, is strongly correlated (Spearman $T$ 2 often $T$ 3, and up to $T$ 4 in large models) with the residual error across models and layers.

This robust finding demonstrates that the geometrically distinct residual component disproportionately contributes to changes in model behavior. Notably, poorly approximated tokens are systematically associated with higher output divergence.

Further, the alignment between residual magnitude and output perturbation varies across layers, pointing to structured regimes where tokenwise approximations either suffice or fail to capture functionally significant updates.

Architectural Trends and Function-Class Dependence

Empirical results across models (e.g. DistilGPT2, Pythia, and Mamba series) suggest both universality and architectural dependency:

Universality: The geometric separation and functional significance of residuals is observed in both Transformer and state-space architectures, implying a general property not tied to self-attention.
Architectural Dependence: Smaller models (e.g., DistilGPT2) can be well-approximated by diagonal transformations; larger models benefit from low-rank maps, implicating a richer but structured tokenwise dynamic.

Relation to prior work is established by contrasting this geometric-functional probe with representation-based interpretability methods (e.g., probing, Logit Lens), emphasizing the complementarity and richer understanding enabled by focusing on transformation geometry rather than information content.

Theoretical Implications and Future Directions

The decomposition framework conceptualizes most layerwise transitions as structured reparameterizations along a dominant local axis (the tokenwise component). The residual is consistently shown to:

Be geometrically non-trivial—neither a small correction nor noise, but situated outside the principal update subspace.
Be functionally critical, as reflected in its strong association with output changes.

The separation between approximable and ‘residual’ computation is not rigidly dictated by architectural choices, but emerges empirically from learning dynamics and the restriction of function classes.

From a theoretical perspective, these findings suggest:

Deep LLMs organize computation such that most of the update energy is tokenwise and low-dimensional, but essential, prediction-critical computation is embedded in non-local updates.
The chosen function class mediates interpretability: higher expressivity can absorb more structure, decreasing residual magnitude but potentially obfuscating functional correspondence.

Two principal open directions are highlighted:

Residual Structure Analysis: Further structural resolution of residuals—e.g., decomposing within-subspace vs. out-of-subspace interactions, or associating residuals with specific types of contextual computation.
Interpretability-Expressivity Trade-off: Understanding how increasing the function class’s complexity impacts both approximation quality and the interpretability/functionality alignment of the residual.

Conclusion

This work rigorously delineates the geometric and functional anatomy of layerwise updates in deep LLMs, establishing a framework that isolates a dominant tokenwise direction and a geometrically/behaviorally distinct residual. The findings demonstrate that crucial, prediction-changing computation preferentially resides in this residual, a fact that persists across architectures and model scales. This geometric perspective provides a new, architecture-agnostic tool for interpretability and could inform both mechanistic analysis and novel training paradigms that target functional structure in deep models.

Markdown Report Issue