Theoretical Justification of the “Optimal” Scrambling Range

Establish a rigorous theoretical justification for the empirically observed “optimal” range of the Information Scrambling Index in Vision Transformers trained on ImageNet-1k classification, where the Information Scrambling Index at layer ℓ is defined as the difference between the normalized all-to-all reconstruction performance and the normalized self-only reconstruction performance for recovering pre–positional-encoding patch embeddings from layer-ℓ token representations; clarify how this range depends on the auxiliary diagnostics Attention Consensus Index (ACI) and CLS Centrality (CCC) used to characterize communication and hub marginalization.

Background

The paper introduces the Information Scrambling Index as a layer-wise diagnostic of global versus local token interactions in Vision Transformers. It is computed as the difference between two normalized reconstruction metrics that attempt to recover pre–positional-encoding patch embeddings from current layer representations: one using only the token itself (self-only) and one using all tokens (all-to-all). The authors empirically identify three regimes—communication collapse, controlled consensus, and chaotic diffusion—correlated with model performance and final-layer geometric quality (Neural Collapse metrics).

Within this framework, the authors report an empirically observed low-positive band of the Scrambling Index in ViT-B (approximately 0.004 to 0.009, under their normalization) that aligns with better geometry and performance. They also compute attention-graph diagnostics, including the Attention Consensus Index (ACI) and CLS Centrality (CCC), and note that efficient models tend to marginalize the [CLS] hub while maintaining controlled mixing. Despite these empirical trends, they explicitly state that a full theoretical justification for the “optimal” scrambling range and its dependence on these proxies remains open.

References

The “optimal” scrambling range we report is empirically observed in this setting and depends on our proxies (InfoX, ACI, CCC); a full theoretical justification is still open.

— Mechanisms of Non-Monotonic Scaling in Vision Transformers (2511.21635 - Kumar, 26 Nov 2025) in Limitations (subsection of Section 6: Discussion)

Theoretical Justification of the “Optimal” Scrambling Range

Sponsor

Background

References

Related Problems