Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On the Surprising Effectiveness of Attention Transfer for Vision Transformers (2411.09702v1)

Published 14 Nov 2024 in cs.LG, cs.AI, cs.CV, and cs.NE

Abstract: Conventional wisdom suggests that pre-training Vision Transformers (ViT) improves downstream performance by learning useful representations. Is this actually true? We investigate this question and find that the features and representations learned during pre-training are not essential. Surprisingly, using only the attention patterns from pre-training (i.e., guiding how information flows between tokens) is sufficient for models to learn high quality features from scratch and achieve comparable downstream performance. We show this by introducing a simple method called attention transfer, where only the attention patterns from a pre-trained teacher ViT are transferred to a student, either by copying or distilling the attention maps. Since attention transfer lets the student learn its own features, ensembling it with a fine-tuned teacher also further improves accuracy on ImageNet. We systematically study various aspects of our findings on the sufficiency of attention maps, including distribution shift settings where they underperform fine-tuning. We hope our exploration provides a better understanding of what pre-training accomplishes and leads to a useful alternative to the standard practice of fine-tuning

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Alexander C. Li (10 papers)
  2. Yuandong Tian (128 papers)
  3. Beidi Chen (61 papers)
  4. Deepak Pathak (91 papers)
  5. Xinlei Chen (106 papers)

Summary

Analyzing the Effectiveness of Attention Transfer in Vision Transformers

The paper "On the Surprising Effectiveness of Attention Transfer for Vision Transformers" by Alexander C. Li et al. brings forth an intriguing new perspective on the traditional utilization of Vision Transformers (ViTs). The research challenges the prevailing notion that pre-training benefits arise from feature representations, introducing instead the concept of attention transfer—where the attention patterns alone from pre-trained models can enable comparable performance in downstream tasks.

Key Contributions and Findings

The research proposes and systematically evaluates the idea of attention transfer in Vision Transformers, presenting two primary methods: Attention Copy and Attention Distillation.

  1. Attention Copy: Here, the student network adopts the attention patterns from a pre-trained teacher while learning its own features. Surprisingly, this method recaptures a significant portion of the pre-training benefits, achieving a top-1 accuracy of 85.1% on ImageNet-1K, which represents 77.8% of the performance gap between scratch training and full weight tuning.
  2. Attention Distillation: This involves the student learning to emulate the teacher's attention patterns with an explicit distillation loss. This approach achieves a performance equivalent to that of weight fine-tuning, demonstrating its potential as a direct substitute when fine-tuning is impractical, matching the fine-tuning accuracy of 85.7%.

The paper also dives into detailed analyses regarding the sufficiency of attention maps. Notably, transferring a subset of layers or heads still retains substantial performance benefits, emphasizing that the critical information often resides in top-layer attentions or subsets of heads.

Implications

These findings suggest that the traditional assumption attributing pre-training efficacy solely to learned features may require reevaluation. The decoupling strategy underscores that the structured way information is channeled — through learned attention patterns — plays a pivotal role. In particular, the insights from the effectiveness of attention maps challenge practitioners in machine learning to explore lighter and potentially more robust alternatives to weight sharing, thereby providing alternative paths for advancing secure and scalable deployment strategies of ViTs.

Future Prospects

Theoretical and practical extensions lie in verifying this decoupled attention mechanism across different domains or model sizes, which could lead to improved understanding of model scalability and domain generality. Such transfer methods might also encourage new architectures that leverage abundant attention computations, enlarging the hypothesis space using fewer computational resources.

In conclusion, this paper issues a paradigm shift in understanding ViT pre-training by emphasizing the significance of attention transfer. It opens avenues for re-thinking efficient training strategies, potentially broadening the landscape of transferable algorithms in neural networks beyond the domains of vision and across different architectures.