Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Low-latency vision transformers via large-scale multi-head attention (2506.23832v1)

Published 30 Jun 2025 in cs.CV

Abstract: The emergence of spontaneous symmetry breaking among a few heads of multi-head attention (MHA) across transformer blocks in classification tasks was recently demonstrated through the quantification of single-nodal performance (SNP). This finding indicates that each head focuses its attention on a subset of labels through cooperation among its SNPs. This underlying learning mechanism is generalized to large-scale MHA (LS-MHA) using a single matrix value representing single-head performance (SHP), analogous to single-filter performance in convolutional neural networks (CNNs). The results indicate that each SHP matrix comprises multiple unit clusters such that each label being explicitly recognized by a few heads with negligible noise. This leads to an increased signal-to-noise ratio (SNR) along the transformer blocks, thereby improving classification accuracy. These features give rise to several distinct vision transformer (ViT) architectures that achieve the same accuracy but differ in their LS-MHA structures. As a result, their soft committee yields superior accuracy, an outcome not typically observed in CNNs which rely on hundreds of filters. In addition, a significant reduction in latency is achieved without affecting the accuracy by replacing the initial transformer blocks with convolutional layers. This substitution accelerates early-stage learning, which is then improved by subsequent transformer layers. The extension of this learning mechanism to natural language processing tasks, based on quantitative differences between CNNs and ViT architectures, has the potential to yield new insights in deep learning. The findings are demonstrated using compact convolutional transformer architectures trained on the CIFAR-100 dataset.

Summary

We haven't generated a summary for this paper yet.