Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Discrete Representations Strengthen Vision Transformer Robustness (2111.10493v2)

Published 20 Nov 2021 in cs.CV

Abstract: Vision Transformer (ViT) is emerging as the state-of-the-art architecture for image recognition. While recent studies suggest that ViTs are more robust than their convolutional counterparts, our experiments find that ViTs trained on ImageNet are overly reliant on local textures and fail to make adequate use of shape information. ViTs thus have difficulties generalizing to out-of-distribution, real-world data. To address this deficiency, we present a simple and effective architecture modification to ViT's input layer by adding discrete tokens produced by a vector-quantized encoder. Different from the standard continuous pixel tokens, discrete tokens are invariant under small perturbations and contain less information individually, which promote ViTs to learn global information that is invariant. Experimental results demonstrate that adding discrete representation on four architecture variants strengthens ViT robustness by up to 12% across seven ImageNet robustness benchmarks while maintaining the performance on ImageNet.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Chengzhi Mao (38 papers)
  2. Lu Jiang (90 papers)
  3. Mostafa Dehghani (64 papers)
  4. Carl Vondrick (93 papers)
  5. Rahul Sukthankar (39 papers)
  6. Irfan Essa (91 papers)
Citations (39)
X Twitter Logo Streamline Icon: https://streamlinehq.com