Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Curved Representation Space of Vision Transformers (2210.05742v2)

Published 11 Oct 2022 in cs.CV

Abstract: Neural networks with self-attention (a.k.a. Transformers) like ViT and Swin have emerged as a better alternative to traditional convolutional neural networks (CNNs). However, our understanding of how the new architecture works is still limited. In this paper, we focus on the phenomenon that Transformers show higher robustness against corruptions than CNNs, while not being overconfident. This is contrary to the intuition that robustness increases with confidence. We resolve this contradiction by empirically investigating how the output of the penultimate layer moves in the representation space as the input data moves linearly within a small area. In particular, we show the following. (1) While CNNs exhibit fairly linear relationship between the input and output movements, Transformers show nonlinear relationship for some data. For those data, the output of Transformers moves in a curved trajectory as the input moves linearly. (2) When a data is located in a curved region, it is hard to move it out of the decision region since the output moves along a curved trajectory instead of a straight line to the decision boundary, resulting in high robustness of Transformers. (3) If a data is slightly modified to jump out of the curved region, the movements afterwards become linear and the output goes to the decision boundary directly. In other words, there does exist a decision boundary near the data, which is hard to find only because of the curved representation space. This explains the underconfident prediction of Transformers. Also, we examine mathematical properties of the attention operation that induce nonlinear response to linear perturbation. Finally, we share our additional findings, regarding what contributes to the curved representation space of Transformers, and how the curvedness evolves during training.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  2. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  3. Are transformers more robust than CNNs? In NIPS, 2021.
  4. Intriguing properties of vision transformers. In NIPS, 2021.
  5. Vision transformers are robust learners. In AAAI, 2022.
  6. Revisiting the calibration of modern neural networks. In NIPS, 2021.
  7. Explaining and harnessing adversarial examples. In ICLR, 2015.
  8. ImageNet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015.
  9. Deep residual learning for image recognition. In CVPR, 2016.
  10. Twins: Revisiting the design of spatial attention in vision transformers. In NIPS, 2021.
  11. Shuffle transformer: Rethinking spatial shuffle for vision transformer. arXiv preprint arXiv:2106.03650, 2021.
  12. LocalViT: Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707, 2021.
  13. Training data-efficient image transformers & distillation through attention. In ICML, 2021.
  14. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, 2021.
  15. Early convolutions help transformers see better. In NIPS, 2021.
  16. Focal attention for long-range interactions in vision transformers. In NIPS, 2021.
  17. Tokens-to-token ViT: Training vision transformers from scratch on ImageNet. In ICCV, 2021.
  18. Swin transformer v2: Scaling up capacity and resolution. In CVPR, 2022.
  19. Reveal of vision transformers robustness against adversarial attacks. arXiv preprint arXiv:2106.03734, 2021.
  20. Understanding robustness of transformers for image classification. In ICCV, 2021.
  21. How do vision transformers work? In ICLR, 2022.
  22. Adversarial robustness comparison of vision transformer and MLP-mixer to CNNs. In Proceedings of the British Machine Vision Conference (BMVC), 2021.
  23. On calibration of modern meural networks. In ICML, 2017.
  24. On mixup training: Improved calibration and predictive uncertainty for deep neural networks. In NIPS, 2019.
  25. Combining ensembles and data augmentation can harm your calibration. In ICLR, 2021.
  26. On the number of linear regions of deep neural networks. In NIPS, 2014.
  27. Complexity of linear regions in deep networks. In ICML, 2019.
  28. Deep ReLU networks have surprisingly few activation patterns. In NIPS, 2019.
  29. Matus Telgarsky. Representation benefits of deep feedforward networks. arXiv preprint arXiv:1509.08101, 2015.
  30. Bounding and counting linear regions of deep neural networks. In ICML, 2018.
  31. On the expressive power of deep neural networks. In ICML, 2017.
  32. Empirical studies on the properties of linear regions in deep neural networks. In ICLR, 2020.
  33. Trajectory growth lower bounds for random sparse deep ReLU networks. In ICML, 2021.
  34. Deep ReLU networks preserve expected length. In ICLR, 2022.
  35. Robust large margin deep neural networks. IEEE Transactions on Signal Processing, 2017.
  36. Sensitivity and generalization in neural networks: An empirical study. In ICLR, 2018.
  37. Obtaining well calibrated probabilities using Bayesian binning. In AAAI, 2015.
  38. MobileNetV2: Inverted residuals and linear bottlenecks. In CVPR, 2018.
  39. Searching for MobileNetV3. In ICCV, 2019.
  40. Adversarial machine learning at scale. In ICLR, 2017.
  41. Monson H. Hayes. Statistical digital signal processing and modeling. John Wiley & Sons, Inc., 1st edition, 1996.
  42. A convnet for the 2020s. In CVPR, 2022.
  43. Machine learning refined: Foundations, algorithms, and applications. Cambridge University Press, 2nd edition, 2020.
  44. Investigating decision boundaries of trained neural networks. arXiv preprint arXiv:1908.02802, 2019.
  45. Understanding softmax confidence and uncertainty. arXiv preprint arXiv:2106.04972, 2021.
  46. Why relu networks yield high-confidence predictions far away from the training data and how to mitigate the problem. In CVPR, 2019.
  47. Conditional deep learning for energy-efficient and enhanced pattern recognition. In DATE, 2016.
  48. Boundary thickness and robustness in learning models. In NIPS, 2020.
  49. Aries: Efficient testing of deep neural networks via labeling-free accuracy estimation. arXiv preprint arXiv:2207.10942, 2023.
  50. Decoupled weight decay regularization. In ICLR, 2019.
Citations (6)

Summary

We haven't generated a summary for this paper yet.