Geometric and Dynamic Scaling in Deep Transformers (2601.01014v2)

Published 3 Jan 2026 in cs.LG and cs.AI

Abstract: Despite their empirical success, pushing Transformer architectures to extreme depth often leads to a paradoxical failure: representations become increasingly redundant, lose rank, and ultimately collapse. Existing explanations largely attribute this phenomenon to optimization instability or vanishing gradients, yet such accounts fail to explain why collapse persists even under modern normalization and initialization schemes. In this paper, we argue that the collapse of deep Transformers is fundamentally a geometric problem. Standard residual updates implicitly assume that feature accumulation is always beneficial, but offer no mechanism to constrain update directions or to erase outdated information. As depth increases, this leads to systematic drift off the semantic manifold and monotonic feature accumulation, causing representational degeneracy. We propose a unified geometric framework that addresses these failures through two orthogonal principles. First, manifold-constrained hyper-connections restrict residual updates to valid local tangent directions, preventing uncontrolled manifold drift. Second, deep delta learning introduces data-dependent, non-monotonic updates that enable reflection and erasure of redundant features rather than their unconditional accumulation. Together, these mechanisms decouple the direction and sign of feature updates, yielding a stable geometric evolution across depth. We term the resulting architecture the Manifold-Geometric Transformer (MGT). Our analysis predicts that enforcing geometric validity while allowing dynamic erasure is essential for avoiding rank collapse in ultra-deep networks. We outline an evaluation protocol for Transformers exceeding 100 layers to test the hypothesis that geometry, rather than depth itself, is the key limiting factor in deep representation learning.

Abstract PDF Chat (Pro)

Summary

The paper presents a geometric analysis of representational collapse in deep Transformers and its remedy via manifold-constrained updates.
It introduces two innovations: manifold-constrained hyper-connections for directed update projection and deep delta learning for dynamic gating and erasure.
Experiments show enhanced feature diversity, stable scaling up to 200 layers, and improved perplexity on language modeling tasks.

Geometric and Dynamic Scaling in Deep Transformers

Motivation and Problem Statement

The paper "Geometric and Dynamic Scaling in Deep Transformers" (2601.01014) addresses the persistent degeneracy in very deep Transformer networks, specifically the phenomenon where representations progressively lose diversity, suffer rank collapse, and become severely redundant. Contrary to the standard narrative attributing this failure to optimization issues such as vanishing gradients, the work posits a fundamentally geometric cause: unrestricted residual updates drive representations off their intrinsic data manifolds and monotonically accumulate noise, undermining feature expressiveness and model scalability.

Theoretical Contributions

The key thesis is that standard residual connections, which employ unconstrained vector addition in Euclidean space, are ill-suited for deep architectures whose states are better characterized as evolving on lower-dimensional, nonlinear semantic manifolds. The authors advance a formalization in which network updates should:

Respect Geometric Validity: Updates must lie within the local tangent space of the data manifold at every layer, ensuring the trajectory remains semantically meaningful.
Allow Dynamic Traversal: The architecture should be equipped to not merely accumulate new information, but also erase or reflect upon redundant or outdated features in a data-dependent manner.

To operationalize these principles, the authors propose the Manifold-Geometric Transformer (MGT), underpinned by two main components:

Manifold-Constrained Hyper-Connections (mHC): A generalized geometric projection operator that regularizes updates by projecting them onto an approximation of the manifold tangent space, suppressing noise-prone directions and curbing off-manifold drift.
Deep Delta Learning (DDL): A dynamic gating mechanism with learnable reflection and erasure capability, formulated using a generalized Householder update. This structure decouples the direction and sign of updates, explicitly enabling negative (erasure) or zero (identity) update steps in addition to standard (accumulative) progression.

Architectural Innovations

The MGT block deviates from canonical Transformer designs by inserting an explicit geometric processing phase between feature generation and feature propagation. Each block performs the following sequence:

Feature Generation through LayerNorm and Mixers (MHSA/FFN), yielding a raw update vector.
Geometric Rectification by mHC projection, which soft-constrains the update direction to the manifold tangent.
Delta Dynamics via the DDL controller, which computes a dynamic gating scalar based on the current context, authorizing explicit erasure (through sign reversal) of state features.
Generalized Householder Update that incorporates both the rectified update and an adaptive subtraction proportional to the existing feature state.

The synergy of geometric and dynamic controls is theoretically justified, and the model is constructed to allow isolated examination of each mechanism's effect, supporting rigorous ablation-based analysis.

Experimental Protocols and Results

A comprehensive suite of five experiments is delineated to systematically test MGT's hypotheses and isolate its empirical behavior, most notably:

Rank Evolution Analysis verifies that while standard Transformers display monotonic rank decay (i.e., feature collapse) with increasing depth (up to 100 layers), MGT maintains normalized effective rank well above 0.5, indicating stable feature diversity even at extreme depths.
Ablation Study confirms the orthogonal benefits and superadditive synergy of mHC (geometry) and DDL (dynamic), with joint application yielding more than the sum of their individual effects.
Beta Distribution Analysis of the Gate Parameter reveals that DDL dynamically transitions from feature accumulation modes in early layers ( $\mathbb{E}[\beta] > 0$ ), to erasure in deeper layers (high fraction of $\beta < 0$ ), consistent with the theoretical argument that active semantic refinement is required as representations propagate.
Depth Scaling Experiments demonstrate favorable perplexity scaling for MGT compared to vanilla Transformers, with robust training stability and convergence even at 200 layers and matched parameter budgets.
Realistic language modeling tasks on WikiText-103 and Open WebText confirm consistent perplexity gains and improved training dynamics. The parameter overhead induced by mHC and DDL is moderate (~25%) given the observed benefits in expressiveness and trainability.

Implications for Transformer Scaling

The findings recast the scaling limits of Transformer architectures as primarily geometric, rather than raw optimization or hardware constraints. By enforcing local geometric validity and robust dynamic control, MGT demonstrates that deep stacking per se is not inherently pathological provided updates are semantically and structurally regularized. This paradigm supersedes simple normalization or initialization heuristics and suggests that future ultra-deep models must treat hidden state propagation as controlled navigation on data manifolds.

In practical terms, the modular structure allows independent tuning and analysis of geometric and dynamic components, and the underlying principles are compatible with other sequence architectures and domain-specific variants. The explicit erasure mechanism has further theoretical implications, paralleling memory management strategies in classical recurrent networks, but now rigorously grounded in manifold geometry.

Future Directions

Future research may extend manifold constraints to richer and adaptive manifold classes, integrate MGT into multimodal or autoregressive architectures, or further explore the dynamics of information erasure through continual and lifelong learning scenarios. There are open questions regarding manifold estimation efficiency in high-dimensional settings and the interplay of geometric priors with other forms of inductive bias (e.g., symmetry, causality).

Conclusion

This work presents a formal geometric account of representational collapse in deep Transformers and proposes the Manifold-Geometric Transformer (MGT) as a theoretically grounded and empirically validated solution. By decomposing residual updates into orthogonal geometric and dynamic operations, MGT achieves robust signal propagation, erasure, and expressivity in ultra-deep stacks, reframing the design of scalable large neural networks from a geometric control perspective. These insights offer a foundation for architectures that approach or exceed present depth limits without succumbing to degeneracy.