Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Refiner: Refining Self-attention for Vision Transformers (2106.03714v1)

Published 7 Jun 2021 in cs.CV

Abstract: Vision Transformers (ViTs) have shown competitive accuracy in image classification tasks compared with CNNs. Yet, they generally require much more data for model pre-training. Most of recent works thus are dedicated to designing more complex architectures or training methods to address the data-efficiency issue of ViTs. However, few of them explore improving the self-attention mechanism, a key factor distinguishing ViTs from CNNs. Different from existing works, we introduce a conceptually simple scheme, called refiner, to directly refine the self-attention maps of ViTs. Specifically, refiner explores attention expansion that projects the multi-head attention maps to a higher-dimensional space to promote their diversity. Further, refiner applies convolutions to augment local patterns of the attention maps, which we show is equivalent to a distributed local attention features are aggregated locally with learnable kernels and then globally aggregated with self-attention. Extensive experiments demonstrate that refiner works surprisingly well. Significantly, it enables ViTs to achieve 86% top-1 classification accuracy on ImageNet with only 81M parameters.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Daquan Zhou (47 papers)
  2. Yujun Shi (23 papers)
  3. Bingyi Kang (39 papers)
  4. Weihao Yu (36 papers)
  5. Zihang Jiang (28 papers)
  6. Yuan Li (393 papers)
  7. Xiaojie Jin (50 papers)
  8. Qibin Hou (82 papers)
  9. Jiashi Feng (295 papers)
Citations (55)

Summary

We haven't generated a summary for this paper yet.