- The paper shows that only large weights avert layer collapse in deep self-attention networks, highlighting a critical limitation of small weight regimes.
- It employs rigorous perturbation analysis to demonstrate that small weights allow multi-layer networks to effectively reduce to a single layer.
- The findings imply that low-rank approximations and strict weight clipping can impair model expressivity, urging a reevaluation of acceleration techniques.
Only Large Weights (And Not Skip Connections) Can Prevent the Perils of Rank Collapse
This paper rigorously analyzes the representational limitations of self-attention networks, particularly transformers, under constraints on the magnitude of their weight matrices. The central claim is that large weights are necessary to prevent a phenomenon termed "layer collapse," and that skip (residual) connections alone are insufficient to maintain the expressive power of deep attention-based architectures.
Context and Motivation
The quadratic computational complexity of attention mechanisms in transformers has motivated a substantial body of work on algorithmic acceleration, often leveraging low-rank approximations or kernel methods. Prior results have established that, under the assumption of small (bounded) weights, attention can be approximated in nearly linear time, but that this is not possible for arbitrary weights unless major complexity-theoretic conjectures are violated.
A parallel line of research, notably Dong, Cordonnier, and Loukas (2021), introduced the concept of rank collapse: in deep self-attention networks with small weights and no skip connections, the output rapidly converges to a rank-1 matrix, severely limiting representational capacity. Their work suggested that skip connections are essential to avoid this collapse, a view widely adopted in the literature.
Main Contributions
This paper challenges the prevailing interpretation by introducing the notion of layer collapse: even with skip connections, if the weights are small, a deep self-attention network can be well-approximated by a single-layer network. The key results are:
- Theorem 1.2 (Informal): For any L-layer self-attention network with skip connections and weight matrices bounded in ℓ∞​ norm by η, there exists a single-layer network whose outputs are within O(η) (in entrywise norm) of the original network on any input.
- Implication: Skip connections do not prevent the collapse of representational power when weights are small; only large weights can do so.
The analysis is constructive, showing that the attention mechanisms in lower layers with small weights can be removed with only a small perturbation to the output, iteratively reducing the network to a single effective layer.
Technical Approach
The proof strategy is based on a careful perturbation analysis of the softmax attention mechanism and its composition across layers. The authors introduce a variant of the "Res" function to formalize the notion of removing constant (rank-1) components from the output, and establish Lipschitz-type bounds for the effect of small perturbations on the softmax and attention outputs.
Key technical steps include:
- Perturbation Lemmas: Quantitative bounds on how the softmax and related functions change under small input perturbations, leveraging the exponential structure of softmax.
- Layer-wise Reduction: Demonstrating that, under small weights, the output of each attention layer is only a small perturbation of its input, allowing for the removal of layers with bounded error accumulation.
- Extension to Multi-head Attention: The results are generalized to the multi-head setting, with explicit dependence on the number of heads and layers.
Numerical and Theoretical Strength
The results are strong in that they hold for arbitrary depth and width, as long as the weight bound is enforced. The error bound is explicit and scales linearly with the weight bound and input norm, independent of the number of layers. This is a significant strengthening over prior work, which only established rank collapse in the absence of skip connections.
The paper also clarifies that the parameter regime for collapse is more stringent in their analysis: their η (an ℓ∞​ bound) can be much smaller than the ℓ1​ bound used in previous work, so their result applies in settings where prior results do not.
Implications
Practical
- Model Design: The findings suggest that attempts to accelerate transformers via low-rank or kernel approximations that require small weights will necessarily limit the model's expressiveness, regardless of architectural modifications such as skip connections.
- Parameter Scaling: To maintain the full expressive power of deep attention networks, it is essential to allow for large weights, even at the cost of quadratic computational complexity.
- Regularization: Over-regularization or aggressive weight clipping may inadvertently induce layer collapse, reducing the effective depth of the model.
Theoretical
- Expressivity-Complexity Tradeoff: There is a fundamental tradeoff between computational efficiency (enabled by small weights and low-rank approximations) and representational power in attention-based models.
- Reinterpretation of Skip Connections: The role of skip connections is not to prevent collapse per se, but to allow the identity function to be preserved; they do not guarantee deep expressivity under small weights.
Future Directions
- Alternative Acceleration: The results motivate the search for new acceleration techniques that do not rely on small weights, or for architectures that can maintain expressivity under different forms of parameterization.
- Norm Regimes: The analysis is primarily in the ℓ∞​ norm; extensions to ℓ1​ or ℓ2​ norms may yield further insights into the parameter regimes for collapse.
- Generalization to Other Architectures: The phenomenon of layer collapse under small weights may extend to other deep architectures with compositional nonlinearities.
Conclusion
This work provides a rigorous and quantitative foundation for the necessity of large weights in deep self-attention networks, independent of skip connections. It refines our understanding of the expressivity of transformers and sets clear limitations on the use of low-rank and kernel-based acceleration methods. The results have direct implications for both the theory and practice of large-scale neural network design, particularly in the context of efficient and expressive LLMs.