Theoretical Justification of the “Optimal” Scrambling Range
Establish a rigorous theoretical justification for the empirically observed “optimal” range of the Information Scrambling Index in Vision Transformers trained on ImageNet-1k classification, where the Information Scrambling Index at layer ℓ is defined as the difference between the normalized all-to-all reconstruction performance and the normalized self-only reconstruction performance for recovering pre–positional-encoding patch embeddings from layer-ℓ token representations; clarify how this range depends on the auxiliary diagnostics Attention Consensus Index (ACI) and CLS Centrality (CCC) used to characterize communication and hub marginalization.
Sponsor
References
The “optimal” scrambling range we report is empirically observed in this setting and depends on our proxies (InfoX, ACI, CCC); a full theoretical justification is still open.